Divide dataset by feature threshold
Background
Splitting a dataset on a single feature is the atomic operation inside every decision tree: choose a feature and a threshold, then partition the rows into those that pass and those that fail. Supporting both numeric thresholds (compare with ) and categorical thresholds (compare with ) lets one routine drive both numeric and categorical splits.
Problem statement
Implement divide_on_feature(X, feature_i, threshold) that partitions the rows of X into two subsets using column feature_i:
where the condition is when threshold is numeric, and when it is non-numeric.
Input
X—np.ndarrayof shape(n_samples, n_features).feature_i—int: index of the column to split on.threshold— a numeric or categorical value defining the split.
Output
Returns a list [X_1, X_2] of two numpy arrays: X_1 holds the rows satisfying the condition, X_2 holds the rest. Either subset may be empty.
Examples
Example 1
Input: X = [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]], feature_i = 0, threshold = 5
Output: [[[5, 6], [7, 8], [9, 10]], [[1, 2], [3, 4]]]
Explanation: column 0 values are . Rows with value (i.e. ) form X_1; the rest () form X_2.
Constraints
- Numeric
threshold→ use>=; non-numericthreshold→ use==. - Preserve the original row order within each subset.
- Return exactly two arrays as
[X_1, X_2]; either may be empty.
Notes
- This is the workhorse behind tree node splits: pair it with an impurity criterion (Gini / entropy) and you have a complete decision-tree split.
- Using
>=(rather than>) means a sample exactly equal to the threshold lands in the first subset.
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Numeric split: rows with feature >= threshold go first
- •Partition is complete (subset sizes sum to n)
- •Categorical (string) threshold uses equality
- •Threshold above all values yields an empty first subset