Cross-entropy gradient
Background
Cross-entropy is the loss every classifier and language model is trained against. Its gradient with respect to the logits is one of those rare derivations where a page of calculus collapses into a single line — softmax minus the one-hot target — and that one-liner shows up everywhere from logistic regression to the final layer of GPT-4. Implementing it (with a numerically-stable softmax inside) is the canonical "do you actually understand backprop" check.
Problem statement
Implement cross_entropy_grad(logits, target) for a single example. With logits , probabilities , and true class , the loss and its gradient are:
So the gradient is with subtracted from the target coordinate. Use the max-subtraction trick inside the softmax so large logits don't overflow.
Input
logits— 1-Dnp.ndarrayof shape[V]: the pre-softmax scores.target—int: the true class index in .
Output
Returns a 1-D np.ndarray of shape [V]: .
Examples
Example 1 — uniform logits
Input: logits = [0.0, 0.0, 0.0], target = 1
Output: ≈ [0.333, -0.667, 0.333]
Explanation: equal logits give the uniform softmax ; subtracting the one-hot at index 1 makes the target coordinate (negative — push that logit up) while the others stay positive (push down).
Example 2 — the gradient sums to zero
Input: logits = [1.0, 2.0, 3.0, 0.5], target = 2
Output: softmax(logits) with index 2 decremented by 1; sum(output) ≈ 0
Explanation: sums to 1 and the one-hot sums to 1, so the difference sums to exactly 0 — a cheap invariant to sanity-check your implementation.
Constraints
- Compute a numerically-stable softmax (subtract
logits.max()beforeexp) — otherwise large logits (e.g.[1000, 1001, 1002]) produceNaN/inf. - Return : copy the softmax, subtract 1 at the
targetindex. - The output sums to ; the target-class gradient is negative and all others are positive.
- Must match a finite-difference numerical gradient (
atol≈1e-3).
Notes
- Where the one-liner comes from. . Differentiating w.r.t. gives when and when ; subtracting the one-hot encodes both cases at once.
- Reuse. The internal softmax is the same stable routine as the softmax from scratch problem; this gradient is what flows out of the loss into the rest of backprop.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape matches logits
- •Matches the analytic formula softmax - one_hot
- •Matches finite-difference numerical gradient
- •Gradient sums to ~0 (probability constraint)
- •Gradient on the target class is negative
- •Numerically stable for large logits