Softmax (multinomial) regressionMedium

Softmax (multinomial) regression

Background

Softmax regression (multinomial logistic regression) generalises logistic regression from 2 classes to CC. It scores every class with a linear function, the softmax turns those scores into a probability distribution over the classes, and training minimises the multi-class cross-entropy by gradient descent. This is exactly the final linear layer plus loss of almost every neural-network classifier.

Problem statement

Implement train_softmaxreg(X, y, learning_rate, iterations) that trains softmax regression by gradient descent and returns the learned parameters together with the cross-entropy loss at each step. One-hot encode the integer labels into YY, prepend a bias column to XX, start from B=0B = 0, and at each iteration:

P=softmax(XB),softmax(z)c=ezckezkP = \operatorname{softmax}(XB), \qquad \operatorname{softmax}(z)_c = \frac{e^{z_c}}{\sum_{k} e^{z_k}} BBαX(PY)B \leftarrow B - \alpha\, X^\top (P - Y) L=ilogPi,yiL = -\sum_{i} \log P_{i,\, y_i}

Input

  • Xnp.ndarray of shape (N, M): N samples with M features (no bias column — the function prepends one).
  • ynp.ndarray of shape (N,): integer class labels in {0,1,,C1}\{0, 1, \dots, C-1\} (classes start at 0).
  • learning_ratefloat: the step size α\alpha.
  • iterationsint: the number of full-batch gradient steps.

Output

Returns (B, losses):

  • Blist[list[float]] of shape (C, M+1): the parameter matrix transposed (one row per class, bias first), rounded to 4 decimals.
  • losseslist[float] of length iterations: the summed cross-entropy after each step, rounded to 4 decimals.

Examples

Example 1

Input:  X = [[0.5, -1.2], [-0.3, 1.1], [0.8, -0.6]], y = [0, 1, 2]
        learning_rate = 0.01, iterations = 10
Output: B = [[-0.0011, 0.0145, -0.0921],
             [ 0.0020, -0.0598, 0.1263],
             [-0.0009, 0.0453, -0.0342]]
        losses[0] = 3.2958, losses[-1] = 3.0110   (10 values, decreasing)

Explanation: with B=0B = 0 every class probability is 1/31/3, so the first loss is 3log(1/3)=3ln33.2958-3\log(1/3) = 3\ln 3 \approx 3.2958. Each step shifts probability mass toward the correct class, so the loss falls.

Constraints

  • One-hot encode y with C = y.max() + 1 (classes start at 0); prepend a ones column to X; initialise B = 0.
  • Take the softmax per row (over classes); use the summed cross-entropy and the gradient X(PY)X^\top(P - Y).
  • Return B.T (shape (C, M+1)) and all losses rounded to 4 decimals.
  • Tests compare with atol=1e-3.

Notes

  • The softmax-cross-entropy gradient collapses to the same clean residual form X(PY)X^\top(P - Y) as linear and logistic regression — the recurring identity that makes these models train stably.
  • Numerical stability: a production softmax subtracts maxczc\max_c z_c before exponentiating to avoid overflow; with zero init and few steps the raw version is safe.
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reproduces the reference: B matrix and loss trajectory
  • First loss equals N*ln(C) from uniform init
  • Cross-entropy decreases monotonically
  • B has shape (C, M+1): one row per class, bias included