Sinusoidal positional encodingEasy

Sinusoidal positional encoding

Background

This is the positional encoding from the original Transformer paper ("Attention Is All You Need") — fixed and parameter-free, and it extrapolates to sequences longer than anything seen in training. A transformer's attention is permutation-invariant on its own, so positions must be injected somehow; this encoding does it with sines and cosines of geometrically-spaced frequencies. GPT-2 ultimately used learned positional embeddings instead, but the sinusoidal version is the one interviews ask about because the math is non-trivial.

Problem statement

Implement sinusoidal_pe(T, C): build the (T, C) positional-encoding table where, for position t[0,T)t \in [0, T) and dimension-pair index i[0,C/2)i \in [0, C/2):

PE[t,2i]=sin ⁣(t100002i/C),PE[t,2i+1]=cos ⁣(t100002i/C)PE[t,\, 2i] = \sin\!\left(\frac{t}{10000^{\,2i/C}}\right), \qquad PE[t,\, 2i+1] = \cos\!\left(\frac{t}{10000^{\,2i/C}}\right)

Even columns get the sine, odd columns the cosine; each consecutive pair (2i,2i+1)(2i, 2i+1) shares the same frequency divisor 100002i/C10000^{\,2i/C}.

Input

  • Tint: the sequence length (number of positions).
  • Cint: the embedding dimension. Even, so each (2i,2i+1)(2i, 2i+1) pair is one sin/cos couple.

Output

Returns an np.ndarray of shape (T, C), dtype float64.

Examples

Example 1 — the t=0t=0 baseline

Input:  T=4, C=8
Output: row 0 = [0, 1, 0, 1, 0, 1, 0, 1]

Explanation: at t=0t=0 every angle is 0, so all even (sine) entries are sin0=0\sin 0 = 0 and all odd (cosine) entries are cos0=1\cos 0 = 1 — a clean, dimension-independent baseline.

Example 2 — a hand-checked row (T=10,C=4T=10, C=4, inspect t=3t=3)

Input:  T=10, C=4
Output: PE[3] = [sin(3), cos(3), sin(3/100), cos(3/100)]
              ≈ [0.1411, -0.9900, 0.0300, 0.9996]

Explanation: for the first pair (i=0i=0) the divisor is 100000=110000^{0}=1, giving sin3, cos3\sin 3,\ \cos 3; for the second pair (i=1i=1) the divisor is 100002/4=10010000^{2/4}=100, giving sin(3/100), cos(3/100)\sin(3/100),\ \cos(3/100). The frequencies span orders of magnitude — low dims change fast with position, high dims change slowly.

Constraints

  • C is even; even columns are filled with sin\sin, odd columns with cos\cos.
  • The frequency divisor for pair ii is 100002i/C10000^{\,2i/C} (equivalently 10000i/C10000^{\,i'/C} for the even index i=2ii' = 2i).
  • Every entry lies in [1,1][-1, 1] (it is a sine or cosine).
  • Distinct positions must produce distinct rows — no collisions, or attention cannot tell positions apart.
  • Must stay finite (no NaN/Inf) and valid for large T (e.g. 2048) — that is the extrapolation property.

Notes

  • Why this design. The encoding for position t+kt+k is a fixed linear function of the encoding for tt, so attention can implement a "shift by kk" as a constant linear map — a built-in prior on relative position. (Not tested directly here, but it is the reason this scheme was chosen.)
  • Series. Step 2 of the build-gpt track, following token embeddings; the sum of token + positional encodings is what enters the first transformer block.
Python
Loading...

This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Output shape is (T, C)
  • At position t=0, even dims are 0 (sin) and odd dims are 1 (cos)
  • Matches the explicit formula at a hand-checked entry
  • All values lie in [-1, 1]
  • Different positions produce different encodings (no collisions)
  • Larger sequence length still produces a valid table (extrapolation)