Transformer block forward (pre-LN, residual)
Background
This is the GPT-series capstone: the canonical pre-LN transformer block that GPT-2 stacks twelve times. It wires together the components built earlier — LayerNorm (build-gpt-07) and multi-head attention (build-gpt-06, which itself builds on attention/mask/head-split) — plus a position-wise MLP, into two residual sub-layers. Two design choices make it work: residual connections (each sub-layer adds its input back) and pre-LN (LayerNorm sits inside each residual, on the sub-layer input).
Problem statement
Implement transformer_block(x, params, n_head, mask=None) as two sub-layers, attention first then MLP, each wrapped in pre-LN + a residual add:
The MLP is gelu(x @ W_mlp1) @ W_mlp2 with a -wide hidden layer. All helpers (layer_norm, multi_head_attention, mlp, …) are provided in the starter — your job is to wire the two pre-LN → sub-layer → residual stages with parameters drawn from params.
Input
x—np.ndarrayof shape(B, T, C): the input sequence batch.params—dictwith:gamma1/beta1(C,)(pre-attention LN),W_qkv(C, 3C)+W_o(C, C)(attention),gamma2/beta2(C,)(pre-MLP LN),W_mlp1(C, 4C)+W_mlp2(4C, C)(MLP).n_head—int: number of attention heads (dividesC).mask— optional(T, T)bool, typically the causal mask.
Output
Returns an np.ndarray of shape (B, T, C) — same shape as x (so blocks stack).
Examples
Example 1 — the two-residual structure
h = x + multi_head_attention(layer_norm(x, gamma1, beta1), W_qkv, W_o, n_head, mask)
out = h + mlp(layer_norm(h, gamma2, beta2), W_mlp1, W_mlp2)
return out
Explanation: LayerNorm is applied to each sub-layer's input (pre-LN); the sub-layer output is then added back to its input. Attention comes before the MLP, and the second residual builds on h (the post-attention stream), not on x.
Example 2 — zeroed projections make the block the identity
Input: params with W_qkv = W_o = W_mlp1 = W_mlp2 = 0 (any LN params), any x
Output: transformer_block(x, params, n_head) == x (to numerical precision)
Explanation: zero projection matrices make both multi_head_attention(...) and mlp(...) output zeros, so the two residual sums give . This identity holds only if both residual adds are present — a missing residual breaks it immediately, which is exactly what the diagnostic test checks.
Constraints
- Pre-LN: normalise the sub-layer input (
LNinside the residual), not the residual output. Post-LN (LN(x + sublayer(x))) is a different, less stable design. - Order is attention then MLP; the MLP sub-layer operates on
h, the output of the attention residual. - Both residual sums are required — the zeroed-weights identity test fails if either is missing.
- The
mask(if given) is passed straight through to the attention sub-layer. - Output shape equals input shape
(B, T, C); tests compare against a reference withatol≈1e-6and require determinism.
Notes
- Why pre-LN. With LN inside the residual, the skip path is an exact identity, so gradients flow back through it unchanged — which is what makes very deep stacks trainable. Post-LN puts LayerNorm in the gradient path and destabilises deep models; GPT-2, GPT-3, and Llama all use pre-LN.
- You've built GPT-2. Stack twelve of these, add token + positional embeddings on the input and an unembedding + softmax on the output, and you have GPT-2 small.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape is (B, T, C)
- •Matches reference: pre-LN with two residual sums
- •Diagnostic: zeroed projection weights -> block is identity (residual saves us)
- •Causal mask is honoured (output deterministic given fixed input + mask)
- •Determinism: same input + params -> same output