Scaled dot-product self-attention
Background
Scaled dot-product attention is the single most important operation in modern deep learning — every attention head (self, cross, multi-head, grouped-query) bottoms out in this function. Each query compares itself against every key via a dot product, those scores become a probability distribution with softmax, and the output is the matching weighted average of the values. This problem is the single-head, self-attention version with an optional mask.
Problem statement
Implement attention(Q, K, V, mask=None):
Concretely: compute scores (where is the query/key dimension); if a mask is given, set the blocked score entries to a large negative number ( or ) before softmax; take a numerically-stable row-wise softmax (subtract each row's max); then multiply by .
Input
Q—np.ndarrayof shape(T, D): the queries.K—np.ndarrayof shape(T, D): the keys (sameDas queries).V—np.ndarrayof shape(T, D): the values.mask— optional(T, T)boolean array.True= attend to this position,False= block it.
Output
Returns an np.ndarray of shape (T, D) — one attention-weighted value vector per query.
Examples
Example 1 — uniform attention averages the values
Input: Q = [[0,0],[0,0]], K = [[0,0],[0,0]], V = [[1,0],[0,1]], mask=None
Output: [[0.5, 0.5],
[0.5, 0.5]]
Explanation: all scores are 0, so each row's softmax is and every output is the mean of the value rows: .
Example 2 — a causal mask blocks the future
Input: Q = [[0],[0]], K = [[0],[0]], V = [[10],[20]],
mask = [[True, False],
[True, True ]]
Output: [[10],
[15]]
Explanation: query 0 may only attend to key 0 (the future is masked to ), so its softmax is and the output is . Query 1 attends to both equally → , giving .
Constraints
- Scale by . Without it, dot-product magnitudes grow with ; at (GPT-2's per-head size) softmax saturates to one-hot and gradients vanish. This is the key thing the problem teaches.
- Mask the scores, not the weights. Set blocked entries to a large negative number before softmax (so they become after). Zeroing weights after softmax leaves the surviving weights summing to less than 1.
- Softmax is row-wise (over the key axis,
axis=-1) and must use the max-subtraction stability trick. - Mask convention:
True= attend,False= block. An all-Truemask is identical tomask=None. - Each output row is a convex combination of the value rows (every coordinate lies within that column's min/max of
V). Tests compare against a reference withatol≈1e-6.
Notes
- Why scaling matters mechanically. For independent unit-variance entries, over dimensions has variance ; dividing by restores unit scale so softmax stays in its sensitive (non-saturated) region.
- Series contract. The signature
attention(Q, K, V, mask=None)is exactly whatbuild-gpt-06-multi-head-attentioncalls per head later.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape is (T, D)
- •Matches numpy reference (no mask)
- •Includes the sqrt(d_k) scaling — diagnostic at d=64
- •Honours a causal mask
- •Output rows sum the V rows under a probability distribution
- •Mask of all-True equals no mask