BatchNorm forward (train + eval modes)
Background
BatchNorm (Ioffe & Szegedy, 2015) was the first normalisation layer, and it stabilised the training of deep networks by keeping each feature's activations at roughly zero mean and unit variance. It normalises across the batch axis — every feature channel gets its own mean and variance computed over all examples in the batch. It has two distinct behaviours: a train mode that uses the current batch's statistics (and updates a running average for later), and an eval mode that uses those saved running statistics so inference is deterministic.
Problem statement
Implement batch_norm_forward(x, gamma, beta, running_mean, running_var, momentum, eps, training) for a (B, C) input, normalising along the batch axis (axis=0) per feature.
Train mode () — normalise with batch statistics and update the running buffers in place:
with running buffers updated by an exponential moving average of weight momentum on the new batch:
Eval mode () — normalise with the saved running statistics and do not update them:
Input
x—np.ndarrayof shape(B, C):Bexamples,Cfeature channels.gamma—(C,), the learned per-feature scale .beta—(C,), the learned per-feature shift .running_mean—(C,), the running mean buffer. Mutated in place whentraining=True.running_var—(C,), the running variance buffer. Mutated in place whentraining=True.momentum— float in , the EMA weight on the new batch stats (PyTorch default0.1).eps— float, numerical-stability constant added inside the square root.training— bool,Truefor train mode,Falsefor eval mode.
Output
Returns an np.ndarray of the same shape as x. In train mode it also mutates running_mean and running_var in place; in eval mode those buffers are left untouched.
Examples
Example 1 — train mode (B=2, C=1)
Input: x=[[0.0], [4.0]], gamma=[1.0], beta=[0.0],
running_mean=[0.0], running_var=[1.0],
momentum=0.1, eps=0, training=True
Output: out=[[-1.0], [1.0]]
running_mean -> [0.2], running_var -> [1.3]
Explanation: batch stats along axis=0 are and (biased, divided by ). So , giving out . The running buffers then move 10% toward the batch: and .
Example 2 — eval mode (B=2, C=2)
Input: x=[[10.0, 20.0], [30.0, 40.0]], gamma=[1.0, 1.0], beta=[0.0, 0.0],
running_mean=[5.0, 5.0], running_var=[4.0, 4.0],
momentum=0.1, eps=1e-9, training=False
Output: [[2.5, 7.5], [12.5, 17.5]] (running_mean / running_var unchanged)
Explanation: eval ignores the batch and uses the saved stats: . The running buffers are not modified.
Constraints
xis(B, C); statistics are computed alongaxis=0(the batch axis), so each of theCfeatures has its own and .- Use the biased variance (divide by , not ) so it matches the normalisation denominator — this is
np.var's default. - Train mode mutates
running_meanandrunning_varin place (*=,+=); eval mode must leave them exactly unchanged. momentumweights the new batch stats: .- is added inside the square root: .
- Tests compare with tolerances from
atol=1e-6(means) toatol=1e-3(std).
Notes
- Deterministic inference. Eval normalises with the saved running stats, never the eval batch, so a single example yields the same output regardless of what else shares its batch.
- Why BatchNorm breaks for autoregressive LMs. At generation you produce one token at a time, so the "eval batch" is size 1 and batch statistics are degenerate. That is why GPT-2 and friends use LayerNorm — which has no batch axis and normalises along
axis=-1per token — instead.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Train mode: output shape matches input
- •Diagnostic: train mode normalises each feature to mean ~0, std ~1
- •Train mode mutates running_mean and running_var
- •Eval mode does NOT mutate running stats and uses them for normalisation
- •gamma scales, beta shifts (same as LayerNorm)