Sinusoidal positional encoding
Background
This is the positional encoding from the original Transformer paper ("Attention Is All You Need") — fixed and parameter-free, and it extrapolates to sequences longer than anything seen in training. A transformer's attention is permutation-invariant on its own, so positions must be injected somehow; this encoding does it with sines and cosines of geometrically-spaced frequencies. GPT-2 ultimately used learned positional embeddings instead, but the sinusoidal version is the one interviews ask about because the math is non-trivial.
Problem statement
Implement sinusoidal_pe(T, C): build the (T, C) positional-encoding table where, for position and dimension-pair index :
Even columns get the sine, odd columns the cosine; each consecutive pair shares the same frequency divisor .
Input
T—int: the sequence length (number of positions).C—int: the embedding dimension. Even, so each pair is one sin/cos couple.
Output
Returns an np.ndarray of shape (T, C), dtype float64.
Examples
Example 1 — the baseline
Input: T=4, C=8
Output: row 0 = [0, 1, 0, 1, 0, 1, 0, 1]
Explanation: at every angle is 0, so all even (sine) entries are and all odd (cosine) entries are — a clean, dimension-independent baseline.
Example 2 — a hand-checked row (, inspect )
Input: T=10, C=4
Output: PE[3] = [sin(3), cos(3), sin(3/100), cos(3/100)]
≈ [0.1411, -0.9900, 0.0300, 0.9996]
Explanation: for the first pair () the divisor is , giving ; for the second pair () the divisor is , giving . The frequencies span orders of magnitude — low dims change fast with position, high dims change slowly.
Constraints
Cis even; even columns are filled with , odd columns with .- The frequency divisor for pair is (equivalently for the even index ).
- Every entry lies in (it is a sine or cosine).
- Distinct positions must produce distinct rows — no collisions, or attention cannot tell positions apart.
- Must stay finite (no
NaN/Inf) and valid for largeT(e.g. 2048) — that is the extrapolation property.
Notes
- Why this design. The encoding for position is a fixed linear function of the encoding for , so attention can implement a "shift by " as a constant linear map — a built-in prior on relative position. (Not tested directly here, but it is the reason this scheme was chosen.)
- Series. Step 2 of the build-gpt track, following token embeddings; the sum of token + positional encodings is what enters the first transformer block.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape is (T, C)
- •At position t=0, even dims are 0 (sin) and odd dims are 1 (cos)
- •Matches the explicit formula at a hand-checked entry
- •All values lie in [-1, 1]
- •Different positions produce different encodings (no collisions)
- •Larger sequence length still produces a valid table (extrapolation)