Best Gini-based split (decision tree)Medium

Best Gini-based split (decision tree)

Background

A decision tree grows by repeatedly splitting the data on one feature and threshold to make the two resulting groups as class-pure as possible. Gini impurity measures how mixed a group is: 00 when every label agrees, up to 0.50.5 for a 50/50 binary mix. The best split is the (feature, threshold) pair that minimises the size-weighted average Gini of its two children. This split-finding routine is the inner loop of CART and of every gradient-boosted tree.

Problem statement

Implement find_best_split(X, y) that returns the (feature_index, threshold) minimising the weighted Gini impurity of a binary split. For a group with positive-class fraction pp:

Gini(p)=1(p2+(1p)2)\operatorname{Gini}(p) = 1 - \big(p^2 + (1-p)^2\big)

A threshold tt on feature ff sends samples with xftx_f \le t left and xf>tx_f > t right; score it by the size-weighted child impurity:

G(f,t)=nLGini(yL)+nRGini(yR)nG(f, t) = \frac{n_L\,\operatorname{Gini}(y_L) + n_R\,\operatorname{Gini}(y_R)}{n}

Try every feature and every unique value of that feature as a threshold, and return the pair with the smallest GG.

Input

  • Xnp.ndarray of shape (n_samples, n_features): numeric features.
  • ynp.ndarray of shape (n_samples,): binary labels in {0,1}\{0, 1\}.

Output

Returns a tuple (feature_index, threshold):

  • feature_indexint: the column to split on.
  • thresholdfloat: samples with xftx_f \le t go left, xf>tx_f > t go right.

Examples

Example 1

Input:  X = [[2.5], [3.5], [1.0], [4.0]], y = [0, 1, 0, 1]
Output: (0, 2.5)

Explanation: splitting feature 0 at 2.52.5 sends {1.0,2.5}\{1.0, 2.5\} (labels 0,00, 0) left and {3.5,4.0}\{3.5, 4.0\} (labels 1,11, 1) right — two perfectly pure leaves, so the weighted Gini is 00, the minimum possible.

Constraints

  • Candidate thresholds are the unique values of each feature; a sample goes left iff xftx_f \le t.
  • An empty child contributes Gini 00.
  • On ties, return the first split (features ascending, thresholds ascending) that attains the minimum.
  • A vectorised mask per candidate split is fine; you do not need to vectorise the threshold loop.

Notes

  • Gini for a binary split equals 2p(1p)2p(1-p) — it peaks at 0.50.5 when p=0.5p = 0.5 and is 00 at p{0,1}p \in \{0, 1\}, which is why pure children score 00.
  • The weighting by child size matters: a tiny pure leaf must not outweigh a large impure one, so each child's Gini is scaled by its sample count before averaging.
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example: pure split at feature 0, threshold 2.5
  • Picks the separating feature, ignores a constant feature
  • Returns an integer feature index and a float threshold
  • Chosen split achieves the minimum weighted Gini (pure leaves -> 0)