Label-smoothed cross-entropyMedium

Label-smoothed cross-entropy

Background

Label smoothing replaces the one-hot training target with a slightly softer distribution, to stop the model becoming over-confident. Without it, the loss rewards putting probability 1 on the correct class — which needs the logit to run off to ++\infty, hurting calibration. Smoothing caps that reward: the model is "done" once it puts 1α+α/K1-\alpha+\alpha/K on the right class. It is standard regularisation in vision (Inception-v3, ResNet) and sequence models (the Transformer paper, T5).

Problem statement

Implement label_smoothing_loss(logits, target, alpha) for one example. With KK classes, the smoothed target and loss (PyTorch convention) are:

soft[i]=(1α)1[i=y]+αK,L=isoft[i]logsoftmax(z)[i]\text{soft}[i] = (1-\alpha)\,\mathbf{1}[i = y] + \frac{\alpha}{K}, \qquad L = -\sum_i \text{soft}[i]\,\log\text{softmax}(z)[i]

which decomposes into a scaled CE term plus a uniform-prior penalty:

L=(1α)logsoftmax(z)[y]    αKilogsoftmax(z)[i]L = -(1-\alpha)\,\log\text{softmax}(z)[y] \;-\; \frac{\alpha}{K}\sum_i \log\text{softmax}(z)[i]

Use a numerically-stable log_softmax (subtract logits.max() first).

Input

  • logits — 1-D np.ndarray of shape (K,): the pre-softmax scores for one example.
  • targetint: the true class index in [0,K)[0, K).
  • alphafloat in [0,1][0, 1]: the smoothing strength (α=0\alpha=0 is plain cross-entropy).

Output

Returns a float: the scalar loss.

Examples

Example 1 — α=0\alpha = 0 is plain cross-entropy

Input:  logits = [1.0, 2.0, 3.0, 0.5], target = 2, alpha = 0
Output: -log_softmax(logits)[2]

Explanation: with α=0\alpha=0 the smoothed target is the ordinary one-hot, so the loss reduces to cross-entropy at the target class.

Example 2 — smoothing penalises a perfectly-confident prediction

Input:  logits ≈ one-hot at index 2 (e.g. [0,0,100,0]), target = 2
        alpha = 0    -> loss ≈ 0
        alpha = 0.1  -> loss > 0

Explanation: a near-perfect prediction has 0\approx 0 plain CE, but label smoothing places mass α/K\alpha/K on every class, so the model is penalised for not spreading some probability there. The loss is strictly positive and grows with α\alpha.

Constraints

  • Use a stable log_softmax (subtract the max before exp), or large logits like [1000, 1001, 1002] overflow.
  • The smoothed target sums to 1: the true class gets 1α+α/K1-\alpha+\alpha/K, every other class α/K\alpha/K.
  • α=0\alpha=0 exactly equals plain cross-entropy; the loss increases monotonically with α\alpha on a confident-correct example.
  • Returns a scalar float.

Notes

  • Why it helps. Capping the target probability prevents runaway logits, which improves calibration and gives ~1% top-1 accuracy at ImageNet scale, with a smoother loss landscape near the optimum. Typical α=0.1\alpha = 0.1.
  • Related. It is cross-entropy against a softened target — see cross-entropy gradient and KL divergence for the underlying machinery.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • alpha=0 reduces to plain cross-entropy
  • Diagnostic: matches the explicit smoothed formula
  • Smoothing INCREASES loss for a perfectly-confident correct prediction
  • Larger alpha gives larger loss (more regularisation)
  • Numerically stable for large logits