Cross-entropy gradientMedium

Cross-entropy gradient

Background

Cross-entropy is the loss every classifier and language model is trained against. Its gradient with respect to the logits is one of those rare derivations where a page of calculus collapses into a single line — softmax minus the one-hot target — and that one-liner shows up everywhere from logistic regression to the final layer of GPT-4. Implementing it (with a numerically-stable softmax inside) is the canonical "do you actually understand backprop" check.

Problem statement

Implement cross_entropy_grad(logits, target) for a single example. With logits zRVz \in \mathbb{R}^V, probabilities p=softmax(z)p = \text{softmax}(z), and true class yy, the loss and its gradient are:

L=logpy,Lzi=pi1[i=y]L = -\log p_y, \qquad \frac{\partial L}{\partial z_i} = p_i - \mathbf{1}[i = y]

So the gradient is softmax(z)\text{softmax}(z) with 11 subtracted from the target coordinate. Use the max-subtraction trick inside the softmax so large logits don't overflow.

Input

  • logits — 1-D np.ndarray of shape [V]: the pre-softmax scores.
  • targetint: the true class index in [0,V)[0, V).

Output

Returns a 1-D np.ndarray of shape [V]: L/z\partial L / \partial z.

Examples

Example 1 — uniform logits

Input:  logits = [0.0, 0.0, 0.0], target = 1
Output: ≈ [0.333, -0.667, 0.333]

Explanation: equal logits give the uniform softmax [13,13,13][\tfrac13, \tfrac13, \tfrac13]; subtracting the one-hot at index 1 makes the target coordinate 131=23\tfrac13 - 1 = -\tfrac23 (negative — push that logit up) while the others stay positive (push down).

Example 2 — the gradient sums to zero

Input:  logits = [1.0, 2.0, 3.0, 0.5], target = 2
Output: softmax(logits) with index 2 decremented by 1;  sum(output) ≈ 0

Explanation: softmax\text{softmax} sums to 1 and the one-hot sums to 1, so the difference sums to exactly 0 — a cheap invariant to sanity-check your implementation.

Constraints

  • Compute a numerically-stable softmax (subtract logits.max() before exp) — otherwise large logits (e.g. [1000, 1001, 1002]) produce NaN/inf.
  • Return pone_hot(y)p - \text{one\_hot}(y): copy the softmax, subtract 1 at the target index.
  • The output sums to 0\approx 0; the target-class gradient is negative and all others are positive.
  • Must match a finite-difference numerical gradient (atol≈1e-3).

Notes

  • Where the one-liner comes from. L=logpyL = -\log p_y. Differentiating w.r.t. ziz_i gives pi1p_i - 1 when i=yi = y and pip_i when iyi \ne y; subtracting the one-hot encodes both cases at once.
  • Reuse. The internal softmax is the same stable routine as the softmax from scratch problem; this gradient is what flows out of the loss into the rest of backprop.
Python
Loading...

This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Output shape matches logits
  • Matches the analytic formula softmax - one_hot
  • Matches finite-difference numerical gradient
  • Gradient sums to ~0 (probability constraint)
  • Gradient on the target class is negative
  • Numerically stable for large logits