Transformer block forward (pre-LN, residual)Hard

Transformer block forward (pre-LN, residual)

Background

This is the GPT-series capstone: the canonical pre-LN transformer block that GPT-2 stacks twelve times. It wires together the components built earlier — LayerNorm (build-gpt-07) and multi-head attention (build-gpt-06, which itself builds on attention/mask/head-split) — plus a position-wise MLP, into two residual sub-layers. Two design choices make it work: residual connections (each sub-layer adds its input back) and pre-LN (LayerNorm sits inside each residual, on the sub-layer input).

Problem statement

Implement transformer_block(x, params, n_head, mask=None) as two sub-layers, attention first then MLP, each wrapped in pre-LN + a residual add:

h=x+MHA(LN1(x)),out=h+MLP(LN2(h))h = x + \text{MHA}\big(\text{LN}_1(x)\big), \qquad \text{out} = h + \text{MLP}\big(\text{LN}_2(h)\big)

The MLP is gelu(x @ W_mlp1) @ W_mlp2 with a 4C4C-wide hidden layer. All helpers (layer_norm, multi_head_attention, mlp, …) are provided in the starter — your job is to wire the two pre-LN → sub-layer → residual stages with parameters drawn from params.

Input

  • xnp.ndarray of shape (B, T, C): the input sequence batch.
  • paramsdict with: gamma1/beta1 (C,) (pre-attention LN), W_qkv (C, 3C) + W_o (C, C) (attention), gamma2/beta2 (C,) (pre-MLP LN), W_mlp1 (C, 4C) + W_mlp2 (4C, C) (MLP).
  • n_headint: number of attention heads (divides C).
  • mask — optional (T, T) bool, typically the causal mask.

Output

Returns an np.ndarray of shape (B, T, C) — same shape as x (so blocks stack).

Examples

Example 1 — the two-residual structure

h   = x + multi_head_attention(layer_norm(x, gamma1, beta1), W_qkv, W_o, n_head, mask)
out = h + mlp(layer_norm(h, gamma2, beta2), W_mlp1, W_mlp2)
return out

Explanation: LayerNorm is applied to each sub-layer's input (pre-LN); the sub-layer output is then added back to its input. Attention comes before the MLP, and the second residual builds on h (the post-attention stream), not on x.

Example 2 — zeroed projections make the block the identity

Input:  params with W_qkv = W_o = W_mlp1 = W_mlp2 = 0 (any LN params), any x
Output: transformer_block(x, params, n_head) == x   (to numerical precision)

Explanation: zero projection matrices make both multi_head_attention(...) and mlp(...) output zeros, so the two residual sums give x+0+0=xx + 0 + 0 = x. This identity holds only if both residual adds are present — a missing residual breaks it immediately, which is exactly what the diagnostic test checks.

Constraints

  • Pre-LN: normalise the sub-layer input (LN inside the residual), not the residual output. Post-LN (LN(x + sublayer(x))) is a different, less stable design.
  • Order is attention then MLP; the MLP sub-layer operates on h, the output of the attention residual.
  • Both residual sums are required — the zeroed-weights identity test fails if either is missing.
  • The mask (if given) is passed straight through to the attention sub-layer.
  • Output shape equals input shape (B, T, C); tests compare against a reference with atol≈1e-6 and require determinism.

Notes

  • Why pre-LN. With LN inside the residual, the skip path is an exact identity, so gradients flow back through it unchanged — which is what makes very deep stacks trainable. Post-LN puts LayerNorm in the gradient path and destabilises deep models; GPT-2, GPT-3, and Llama all use pre-LN.
  • You've built GPT-2. Stack twelve of these, add token + positional embeddings on the input and an unembedding + softmax on the output, and you have GPT-2 small.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Output shape is (B, T, C)
  • Matches reference: pre-LN with two residual sums
  • Diagnostic: zeroed projection weights -> block is identity (residual saves us)
  • Causal mask is honoured (output deterministic given fixed input + mask)
  • Determinism: same input + params -> same output