Label-smoothed cross-entropy
Background
Label smoothing replaces the one-hot training target with a slightly softer distribution, to stop the model becoming over-confident. Without it, the loss rewards putting probability 1 on the correct class — which needs the logit to run off to , hurting calibration. Smoothing caps that reward: the model is "done" once it puts on the right class. It is standard regularisation in vision (Inception-v3, ResNet) and sequence models (the Transformer paper, T5).
Problem statement
Implement label_smoothing_loss(logits, target, alpha) for one example. With classes, the smoothed target and loss (PyTorch convention) are:
which decomposes into a scaled CE term plus a uniform-prior penalty:
Use a numerically-stable log_softmax (subtract logits.max() first).
Input
logits— 1-Dnp.ndarrayof shape(K,): the pre-softmax scores for one example.target—int: the true class index in .alpha—floatin : the smoothing strength ( is plain cross-entropy).
Output
Returns a float: the scalar loss.
Examples
Example 1 — is plain cross-entropy
Input: logits = [1.0, 2.0, 3.0, 0.5], target = 2, alpha = 0
Output: -log_softmax(logits)[2]
Explanation: with the smoothed target is the ordinary one-hot, so the loss reduces to cross-entropy at the target class.
Example 2 — smoothing penalises a perfectly-confident prediction
Input: logits ≈ one-hot at index 2 (e.g. [0,0,100,0]), target = 2
alpha = 0 -> loss ≈ 0
alpha = 0.1 -> loss > 0
Explanation: a near-perfect prediction has plain CE, but label smoothing places mass on every class, so the model is penalised for not spreading some probability there. The loss is strictly positive and grows with .
Constraints
- Use a stable
log_softmax(subtract the max beforeexp), or large logits like[1000, 1001, 1002]overflow. - The smoothed target sums to 1: the true class gets , every other class .
- exactly equals plain cross-entropy; the loss increases monotonically with on a confident-correct example.
- Returns a scalar
float.
Notes
- Why it helps. Capping the target probability prevents runaway logits, which improves calibration and gives ~1% top-1 accuracy at ImageNet scale, with a smoother loss landscape near the optimum. Typical .
- Related. It is cross-entropy against a softened target — see cross-entropy gradient and KL divergence for the underlying machinery.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •alpha=0 reduces to plain cross-entropy
- •Diagnostic: matches the explicit smoothed formula
- •Smoothing INCREASES loss for a perfectly-confident correct prediction
- •Larger alpha gives larger loss (more regularisation)
- •Numerically stable for large logits