Logistic regression — gradient descentMedium

Logistic regression — gradient descent

Background

Logistic regression is the workhorse binary classifier: it passes a linear score XθX\theta through the sigmoid σ(z)=1/(1+ez)\sigma(z) = 1/(1+e^{-z}) to produce a probability, then is trained by minimising binary cross-entropy (log loss). Unlike linear regression's normal equation there is no closed form — the parameters are found by gradient descent. It underpins everything from click-through prediction to the final layer of a neural-network classifier.

Problem statement

Implement train_logreg(X, y, learning_rate, iterations) that trains logistic regression by gradient descent on the binary cross-entropy loss and returns the learned coefficients together with the loss recorded at every iteration. Prepend a bias column of ones to XX, start from θ=0\theta = 0, and at each step use the sigmoid prediction and the BCE gradient:

y^=σ(Xθ),σ(z)=11+ez\hat{y} = \sigma(X\theta), \qquad \sigma(z) = \frac{1}{1 + e^{-z}} θθαX(y^y)\theta \leftarrow \theta - \alpha\, X^\top (\hat{y} - y) L=i[yilogy^i+(1yi)log(1y^i)]L = -\sum_{i} \left[\, y_i \log \hat{y}_i + (1 - y_i)\log(1 - \hat{y}_i) \,\right]

Input

  • Xnp.ndarray of shape (n_samples, n_features): the feature matrix (no bias column — the function prepends one).
  • ynp.ndarray of shape (n_samples,): binary labels in {0,1}\{0, 1\}.
  • learning_ratefloat: the gradient-descent step size α\alpha.
  • iterationsint: the number of full-batch gradient steps.

Output

Returns a tuple (coefficients, losses):

  • coefficientslist[float] of length n_features + 1 (bias first), each rounded to 4 decimals.
  • losseslist[float] of length iterations: the total BCE loss after each step, rounded to 4 decimals.

Examples

Example 1

Input:  X = [[1.0, 0.5], [-0.5, -1.5], [2.0, 1.5], [-2.0, -1.0]]
        y = [1, 0, 1, 0], learning_rate = 0.01, iterations = 20
Output: coefficients = [-0.0003, 0.4038, 0.3379]
        losses[0] = 2.7726, falling monotonically over the 20 steps

Explanation: with θ=0\theta = 0 every prediction is σ(0)=0.5\sigma(0) = 0.5, so the first loss is 4log(0.5)=4ln22.7726-4\log(0.5) = 4\ln 2 \approx 2.7726. Each gradient step raises y^\hat{y} on the positives and lowers it on the negatives, so the bias stays near 00 while the two feature weights grow to separate the classes and the loss falls monotonically.

Constraints

  • Prepend a single bias column of ones to X; initialise θ\theta to zeros.
  • Use the summed (not averaged) BCE loss, with gradient X(y^y)X^\top(\hat{y} - y).
  • Round the coefficients and every loss value to 4 decimal places.
  • The learning_rate is small enough that the loss is non-increasing; tests compare with atol=1e-3.

Notes

  • The BCE gradient simplifies beautifully: despite the logσ\log\sigma in the loss, L/θ=X(y^y)\partial L / \partial\theta = X^\top(\hat{y} - y) — the same clean form as linear regression's gradient, which is why logistic regression trains so stably.
  • Numerical care: logy^\log\hat{y} diverges if y^\hat{y} reaches exactly 00 or 11. With zero init and few iterations this won't happen, but production code clips y^\hat{y} into [ε,1ε][\varepsilon,\, 1-\varepsilon].
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reproduces the reference run: coefficients and first loss
  • First loss equals N*ln2 from zero init (every prediction is 0.5)
  • BCE loss is monotonically non-increasing under gradient descent
  • Returns a bias term plus one weight per feature