Best Gini-based split (decision tree)
Background
A decision tree grows by repeatedly splitting the data on one feature and threshold to make the two resulting groups as class-pure as possible. Gini impurity measures how mixed a group is: when every label agrees, up to for a 50/50 binary mix. The best split is the (feature, threshold) pair that minimises the size-weighted average Gini of its two children. This split-finding routine is the inner loop of CART and of every gradient-boosted tree.
Problem statement
Implement find_best_split(X, y) that returns the (feature_index, threshold) minimising the weighted Gini impurity of a binary split. For a group with positive-class fraction :
A threshold on feature sends samples with left and right; score it by the size-weighted child impurity:
Try every feature and every unique value of that feature as a threshold, and return the pair with the smallest .
Input
X—np.ndarrayof shape(n_samples, n_features): numeric features.y—np.ndarrayof shape(n_samples,): binary labels in .
Output
Returns a tuple (feature_index, threshold):
feature_index—int: the column to split on.threshold—float: samples with go left, go right.
Examples
Example 1
Input: X = [[2.5], [3.5], [1.0], [4.0]], y = [0, 1, 0, 1]
Output: (0, 2.5)
Explanation: splitting feature 0 at sends (labels ) left and (labels ) right — two perfectly pure leaves, so the weighted Gini is , the minimum possible.
Constraints
- Candidate thresholds are the unique values of each feature; a sample goes left iff .
- An empty child contributes Gini .
- On ties, return the first split (features ascending, thresholds ascending) that attains the minimum.
- A vectorised mask per candidate split is fine; you do not need to vectorise the threshold loop.
Notes
- Gini for a binary split equals — it peaks at when and is at , which is why pure children score .
- The weighting by child size matters: a tiny pure leaf must not outweigh a large impure one, so each child's Gini is scaled by its sample count before averaging.
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example: pure split at feature 0, threshold 2.5
- •Picks the separating feature, ignores a constant feature
- •Returns an integer feature index and a float threshold
- •Chosen split achieves the minimum weighted Gini (pure leaves -> 0)