Scaled dot-product self-attentionMedium

Scaled dot-product self-attention

Background

Scaled dot-product attention is the single most important operation in modern deep learning — every attention head (self, cross, multi-head, grouped-query) bottoms out in this function. Each query compares itself against every key via a dot product, those scores become a probability distribution with softmax, and the output is the matching weighted average of the values. This problem is the single-head, self-attention version with an optional mask.

Problem statement

Implement attention(Q, K, V, mask=None):

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

Concretely: compute scores S=QKdkS = \dfrac{Q K^\top}{\sqrt{d_k}} (where dkd_k is the query/key dimension); if a mask is given, set the blocked score entries to a large negative number (109-10^9 or -\infty) before softmax; take a numerically-stable row-wise softmax (subtract each row's max); then multiply by VV.

Input

  • Qnp.ndarray of shape (T, D): the queries.
  • Knp.ndarray of shape (T, D): the keys (same D as queries).
  • Vnp.ndarray of shape (T, D): the values.
  • mask — optional (T, T) boolean array. True = attend to this position, False = block it.

Output

Returns an np.ndarray of shape (T, D) — one attention-weighted value vector per query.

Examples

Example 1 — uniform attention averages the values

Input:  Q = [[0,0],[0,0]], K = [[0,0],[0,0]], V = [[1,0],[0,1]], mask=None
Output: [[0.5, 0.5],
         [0.5, 0.5]]

Explanation: all scores are 0, so each row's softmax is [0.5,0.5][0.5, 0.5] and every output is the mean of the value rows: 0.5[1,0]+0.5[0,1]=[0.5,0.5]0.5\cdot[1,0] + 0.5\cdot[0,1] = [0.5, 0.5].

Example 2 — a causal mask blocks the future

Input:  Q = [[0],[0]], K = [[0],[0]], V = [[10],[20]],
        mask = [[True, False],
                [True, True ]]
Output: [[10],
         [15]]

Explanation: query 0 may only attend to key 0 (the future is masked to 109-10^9), so its softmax is [1,0][1, 0] and the output is V0=10V_0 = 10. Query 1 attends to both equally → [0.5,0.5][0.5, 0.5], giving 0.510+0.520=150.5\cdot10 + 0.5\cdot20 = 15.

Constraints

  • Scale by dk\sqrt{d_k}. Without it, dot-product magnitudes grow with DD; at dk=64d_k = 64 (GPT-2's per-head size) softmax saturates to one-hot and gradients vanish. This is the key thing the problem teaches.
  • Mask the scores, not the weights. Set blocked entries to a large negative number before softmax (so they become 0\approx 0 after). Zeroing weights after softmax leaves the surviving weights summing to less than 1.
  • Softmax is row-wise (over the key axis, axis=-1) and must use the max-subtraction stability trick.
  • Mask convention: True = attend, False = block. An all-True mask is identical to mask=None.
  • Each output row is a convex combination of the value rows (every coordinate lies within that column's min/max of V). Tests compare against a reference with atol≈1e-6.

Notes

  • Why scaling matters mechanically. For independent unit-variance entries, QKQ\cdot K over dkd_k dimensions has variance dkd_k; dividing by dk\sqrt{d_k} restores unit scale so softmax stays in its sensitive (non-saturated) region.
  • Series contract. The signature attention(Q, K, V, mask=None) is exactly what build-gpt-06-multi-head-attention calls per head later.
Python
Loading...

This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Output shape is (T, D)
  • Matches numpy reference (no mask)
  • Includes the sqrt(d_k) scaling — diagnostic at d=64
  • Honours a causal mask
  • Output rows sum the V rows under a probability distribution
  • Mask of all-True equals no mask