LayerNorm forward
Background
LayerNorm re-establishes zero-mean, unit-variance activations at the start of every sub-layer in a transformer. Without it, activations grow unboundedly as blocks stack — the residual stream's magnitudes blow up, attention saturates, and training diverges. Unlike BatchNorm, LayerNorm normalises per token, over the feature axis only: each token's C-dim vector is its own normalisation universe, with no dependence on other tokens or batch elements. That batch-independence is exactly why transformers use it — it behaves identically whether you process one token or a thousand.
Problem statement
Implement layer_norm(x, gamma, beta, eps=1e-5). For each vector along the last axis (the feature dim of size C):
gamma and beta are learnable (C,) parameters (initialised to 1 and 0, so LN starts as the identity).
Input
x—np.ndarrayof shape(..., C); the last axis is the feature dimension.gamma—(C,), the learnable per-feature scale .beta—(C,), the learnable per-feature shift .eps— float, a small constant added inside the square root for numerical stability.
Output
Returns an np.ndarray of the same shape as x.
Examples
Example 1 — normalising one token (, )
Input: x = [[1.0, 2.0, 3.0, 4.0]], gamma=ones(4), beta=zeros(4)
Output: ≈ [[-1.342, -0.447, 0.447, 1.342]]
Explanation: and . Each entry becomes , giving a vector with mean 0 and std 1 along the feature axis.
Example 2 — constant input collapses to
Input: x = np.full((2, 4, 8), 7.0),
gamma = [1,2,3,4,5,6,7,8], beta = [10,20,30,40,50,60,70,80]
Output: every token = beta = [10,20,30,40,50,60,70,80]
Explanation: when all features are equal the variance is 0 and the numerator is exactly 0, so and the output is . (The eps inside the sqrt keeps the division finite.)
Constraints
- Normalise over the last axis only (
axis=-1, per token) — not across the batch (that would be BatchNorm). Usekeepdims=Trueso broadcast back. - Use the biased variance (divide by , i.e.
np.var's default). - Add
epsinside the square root: — adding it outside changes the math when . - With
gamma=1, beta=0, each token's output has mean , std regardless of input scale; tests useatolfrom1e-6(mean) to1e-3(std). - Must stay finite for large-magnitude inputs (e.g. ).
Notes
- LayerNorm vs BatchNorm. Same z-score-then-scale-shift recipe, different axis: LayerNorm over features per token, BatchNorm over the batch per feature (see BatchNorm forward). LayerNorm's lack of batch dependence is what makes it the right choice for autoregressive generation.
- Series. Step 7 of build-gpt; each transformer block applies LayerNorm before its attention and MLP sub-layers (pre-norm).
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape matches input
- •Diagnostic: with gamma=1, beta=0, each token has mean 0 and std 1 along the last axis
- •gamma scales, beta shifts
- •Constant input -> output is just beta (zero variance after eps)
- •Numerically stable for large-magnitude inputs