LayerNorm forwardEasy

LayerNorm forward

Background

LayerNorm re-establishes zero-mean, unit-variance activations at the start of every sub-layer in a transformer. Without it, activations grow unboundedly as blocks stack — the residual stream's magnitudes blow up, attention saturates, and training diverges. Unlike BatchNorm, LayerNorm normalises per token, over the feature axis only: each token's C-dim vector is its own normalisation universe, with no dependence on other tokens or batch elements. That batch-independence is exactly why transformers use it — it behaves identically whether you process one token or a thousand.

Problem statement

Implement layer_norm(x, gamma, beta, eps=1e-5). For each vector along the last axis (the feature dim of size C):

μ=1Cc=1Cxc,σ2=1Cc=1C(xcμ)2\mu = \frac{1}{C}\sum_{c=1}^{C} x_c, \qquad \sigma^2 = \frac{1}{C}\sum_{c=1}^{C} (x_c - \mu)^2 x^=xμσ2+ϵ,out=γx^+β\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}, \qquad \text{out} = \gamma \odot \hat{x} + \beta

gamma and beta are learnable (C,) parameters (initialised to 1 and 0, so LN starts as the identity).

Input

  • xnp.ndarray of shape (..., C); the last axis is the feature dimension.
  • gamma(C,), the learnable per-feature scale γ\gamma.
  • beta(C,), the learnable per-feature shift β\beta.
  • eps — float, a small constant added inside the square root for numerical stability.

Output

Returns an np.ndarray of the same shape as x.

Examples

Example 1 — normalising one token (C=4C=4, γ=1,β=0\gamma=1, \beta=0)

Input:  x = [[1.0, 2.0, 3.0, 4.0]], gamma=ones(4), beta=zeros(4)
Output: ≈ [[-1.342, -0.447, 0.447, 1.342]]

Explanation: μ=2.5\mu = 2.5 and σ2=14(1.52+0.52+0.52+1.52)=1.25\sigma^2 = \tfrac14(1.5^2 + 0.5^2 + 0.5^2 + 1.5^2) = 1.25. Each entry becomes (x2.5)/1.25(x - 2.5)/\sqrt{1.25}, giving a vector with mean 0 and std 1 along the feature axis.

Example 2 — constant input collapses to β\beta

Input:  x = np.full((2, 4, 8), 7.0),
        gamma = [1,2,3,4,5,6,7,8], beta = [10,20,30,40,50,60,70,80]
Output: every token = beta = [10,20,30,40,50,60,70,80]

Explanation: when all features are equal the variance is 0 and the numerator xμx - \mu is exactly 0, so x^=0\hat{x} = 0 and the output is γ0+β=β\gamma\cdot 0 + \beta = \beta. (The eps inside the sqrt keeps the division finite.)

Constraints

  • Normalise over the last axis only (axis=-1, per token) — not across the batch (that would be BatchNorm). Use keepdims=True so μ,σ2\mu, \sigma^2 broadcast back.
  • Use the biased variance (divide by CC, i.e. np.var's default).
  • Add eps inside the square root: σ2+ϵ\sqrt{\sigma^2 + \epsilon} — adding it outside changes the math when σ2>0\sigma^2 > 0.
  • With gamma=1, beta=0, each token's output has mean 0\approx 0, std 1\approx 1 regardless of input scale; tests use atol from 1e-6 (mean) to 1e-3 (std).
  • Must stay finite for large-magnitude inputs (e.g. 10610^6).

Notes

  • LayerNorm vs BatchNorm. Same z-score-then-scale-shift recipe, different axis: LayerNorm over features per token, BatchNorm over the batch per feature (see BatchNorm forward). LayerNorm's lack of batch dependence is what makes it the right choice for autoregressive generation.
  • Series. Step 7 of build-gpt; each transformer block applies LayerNorm before its attention and MLP sub-layers (pre-norm).
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Output shape matches input
  • Diagnostic: with gamma=1, beta=0, each token has mean 0 and std 1 along the last axis
  • gamma scales, beta shifts
  • Constant input -> output is just beta (zero variance after eps)
  • Numerically stable for large-magnitude inputs