Causal mask: build + apply
Background
GPT-2 is a decoder: at training time it predicts token from tokens , so position must not attend to any position — otherwise the model "cheats" by reading the answer straight out of its input. The fix is a causal (lower-triangular) mask applied to the attention scores before softmax. It is one triangular matrix and one where call, but getting the placement right is what makes future positions receive exactly zero probability.
Problem statement
Implement two functions:
build_causal_mask(T)— return a(T, T)boolean mask withmask[i, j] = Truewhen (position may attend to position ) andFalsewhen . This is the lower triangle including the diagonal.apply_causal_mask(scores)— given a(T, T)score matrix (from ), return a copy with the upper triangle () set to and the rest unchanged, so a subsequent softmax sends those entries to exactly 0.
Input
build_causal_mask(T)—T:int, the sequence length.apply_causal_mask(scores)—scores:np.ndarrayof shape(T, T), the raw attention scores.
Output
build_causal_maskreturns a(T, T)boolarray (lower-triangularTrue).apply_causal_maskreturns a(T, T)floatarray: original scores at/below the diagonal, above it.
Examples
Example 1 — the mask for T = 4
build_causal_mask(4) =
[[ True, False, False, False],
[ True, True, False, False],
[ True, True, True, False],
[ True, True, True, True]]
Explanation: row has True in columns (it can see itself and the past) and False afterward — so row has exactly True entries.
Example 2 — applying it, then softmax
Input: scores = np.full((4, 4), 1.0)
apply_causal_mask(scores) =
[[ 1, -inf, -inf, -inf],
[ 1, 1, -inf, -inf],
[ 1, 1, 1, -inf],
[ 1, 1, 1, 1]]
row 0 after softmax = [1, 0, 0, 0] # future positions are exactly 0
Explanation: the upper triangle is replaced with while the lower triangle keeps its scores; after a row-wise softmax, the entries become exactly and each row's surviving weights sum to 1 (row 0 attends only to position 0).
Constraints
mask[i, j] = Trueiff ; row contains exactlyTruevalues.- Mask before softmax with (or a large negative number) so masked probabilities are exactly 0 — not .
- Do not zero weights after softmax: that leaves each row summing to less than 1 and corrupts the gradient.
apply_causal_maskleaves the at/below-diagonal scores unchanged.
Notes
-infvs-1e9. Both work;-1e9is friendlier to fp16, while-np.infis conceptually cleaner (exp(-inf) = 0exactly).np.tril(np.ones((T, T), bool))builds the mask in one call.- Series.
apply_causal_maskplugs intobuild-gpt-06-multi-head-attention; this is the masking half of the attention stack alongsidebuild-gpt-03.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •build_causal_mask: shape and dtype
- •build_causal_mask: diagonal True; below diagonal True; above False
- •apply_causal_mask: upper triangle becomes -inf
- •After softmax, masked positions are exactly 0 — the diagnostic
- •Larger T works without quadratic blowup