Scaled dot-product attentionMedium

Scaled dot-product attention

Background

Scaled dot-product attention is the single most important operation in modern deep learning — every transformer head (self, cross, causal, multi-head) bottoms out in this function. Each query scores itself against every key via a dot product, those scores become a probability distribution with softmax, and the output is the matching weighted average of the values. This is the general form: the queries can have a different sequence length from the keys/values, and the value dimension is independent of the key dimension, so it covers both self- and cross-attention.

Problem statement

Implement attention(Q, K, V, mask=None):

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

Compute scores QK/dkQK^\top / \sqrt{d_k}; if a mask is given, set the blocked entries to a large negative number (109-10^9) before softmax; take a numerically-stable row-wise softmax (subtract each row's max); then multiply by VV.

Input

  • Qnp.ndarray of shape [Tq, d_k]: the queries.
  • Knp.ndarray of shape [Tk, d_k]: the keys.
  • Vnp.ndarray of shape [Tk, d_v]: the values.
  • mask — optional [Tq, Tk] boolean array: True = attend, False = block.

Output

Returns an np.ndarray of shape [Tq, d_v].

Examples

Example 1 — uniform attention averages the values

Input:  Q = [[0, 0]], K = [[0, 0], [0, 0]], V = [[1, 0], [0, 1]], mask = None
Output: [[0.5, 0.5]]

Explanation: all scores are 0, so the softmax over the two keys is [0.5,0.5][0.5, 0.5] and the single output row is the mean of the value rows. Here Tq=1,Tk=2,dv=2T_q=1, T_k=2, d_v=2, so the result is [1, 2]-shaped.

Example 2 — a causal mask blocks the future

Input:  Q = [[0],[0]], K = [[0],[0]], V = [[10],[20]],
        mask = [[True, False],
                [True, True ]]
Output: [[10], [15]]

Explanation: query 0 may attend only to key 0 (the future is masked to 109-10^9), so its softmax is [1,0][1, 0] and the output is V0=10V_0 = 10. Query 1 attends to both equally → 0.510+0.520=150.5\cdot10 + 0.5\cdot20 = 15.

Constraints

  • Scale by dk\sqrt{d_k}. Without it, dot products grow with the dimension and softmax saturates to one-hot, killing gradients.
  • Mask the scores, not the weights: set blocked entries to a large negative number before softmax (so they become 0\approx 0 after). Zeroing weights after softmax leaves the row summing to less than 1.
  • Softmax is row-wise (over the key axis, axis=-1) and must use the max-subtraction stability trick.
  • Mask convention: True = attend, False = block; an all-True mask equals mask=None. Output shape is [Tq, d_v].

Notes

  • Self vs cross attention. With Tq=TkT_q = T_k this is self-attention; allowing TqTkT_q \ne T_k (and dvdkd_v \ne d_k) makes the same function serve cross-attention.
  • Related. build-gpt-03 is the batched, square-T variant used inside multi-head attention; the dk\sqrt{d_k} scaling is the key idea both share.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Output shape is [Tq, d_v]
  • Matches numpy reference (no mask)
  • Applies sqrt(d_k) scaling
  • Honors a causal mask
  • Mask of all-True equals no mask