Scaled dot-product attention
Background
Scaled dot-product attention is the single most important operation in modern deep learning — every transformer head (self, cross, causal, multi-head) bottoms out in this function. Each query scores itself against every key via a dot product, those scores become a probability distribution with softmax, and the output is the matching weighted average of the values. This is the general form: the queries can have a different sequence length from the keys/values, and the value dimension is independent of the key dimension, so it covers both self- and cross-attention.
Problem statement
Implement attention(Q, K, V, mask=None):
Compute scores ; if a mask is given, set the blocked entries to a large negative number () before softmax; take a numerically-stable row-wise softmax (subtract each row's max); then multiply by .
Input
Q—np.ndarrayof shape[Tq, d_k]: the queries.K—np.ndarrayof shape[Tk, d_k]: the keys.V—np.ndarrayof shape[Tk, d_v]: the values.mask— optional[Tq, Tk]boolean array:True= attend,False= block.
Output
Returns an np.ndarray of shape [Tq, d_v].
Examples
Example 1 — uniform attention averages the values
Input: Q = [[0, 0]], K = [[0, 0], [0, 0]], V = [[1, 0], [0, 1]], mask = None
Output: [[0.5, 0.5]]
Explanation: all scores are 0, so the softmax over the two keys is and the single output row is the mean of the value rows. Here , so the result is [1, 2]-shaped.
Example 2 — a causal mask blocks the future
Input: Q = [[0],[0]], K = [[0],[0]], V = [[10],[20]],
mask = [[True, False],
[True, True ]]
Output: [[10], [15]]
Explanation: query 0 may attend only to key 0 (the future is masked to ), so its softmax is and the output is . Query 1 attends to both equally → .
Constraints
- Scale by . Without it, dot products grow with the dimension and softmax saturates to one-hot, killing gradients.
- Mask the scores, not the weights: set blocked entries to a large negative number before softmax (so they become after). Zeroing weights after softmax leaves the row summing to less than 1.
- Softmax is row-wise (over the key axis,
axis=-1) and must use the max-subtraction stability trick. - Mask convention:
True= attend,False= block; an all-Truemask equalsmask=None. Output shape is[Tq, d_v].
Notes
- Self vs cross attention. With this is self-attention; allowing (and ) makes the same function serve cross-attention.
- Related. build-gpt-03 is the batched, square-
Tvariant used inside multi-head attention; the scaling is the key idea both share.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape is [Tq, d_v]
- •Matches numpy reference (no mask)
- •Applies sqrt(d_k) scaling
- •Honors a causal mask
- •Mask of all-True equals no mask