Rotary Position Embedding (RoPE)
Background
Rotary Position Embedding (RoPE, Su et al. 2021) is the positional scheme that displaced sinusoidal and learned embeddings in modern LLMs — Llama, Mistral, Qwen, and Gemma all use it. Instead of adding a positional signal to the token embeddings, RoPE is applied directly to the query and key vectors before attention: it splits each feature vector into 2D pairs and rotates each pair by an angle that grows with position. Because a rotation preserves length, the attention dot product ends up encoding both content similarity and relative position (through the angle difference).
Problem statement
Implement apply_rotary(x) for a (B, T, C) tensor (C even). Split the last axis into C/2 pairs; rotate the pair at position by angle :
i.e. and .
Input
x—np.ndarrayof shape(B, T, C);Cmust be even.
Output
Returns an np.ndarray of shape (B, T, C) — RoPE applied.
Examples
Example 1 — position 0 is the identity
Input: any x; inspect position t=0
Output: out[:, 0, :] == x[:, 0, :] (unchanged)
Explanation: at every angle , so and each pair is rotated by nothing — the first token passes through untouched.
Example 2 — hand-checked rotation (, position )
Input: x[0, 1] = [1.0, 0.0, 1.0, 0.0]
Output: [cos(1), sin(1), cos(0.01), sin(0.01)] ≈ [0.5403, 0.8415, 0.99995, 0.01]
Explanation: with the two pair frequencies are and . At : pair 0 [1,0] rotates by 1 rad → [cos 1, sin 1]; pair 1 [1,0] rotates by 0.01 rad → [cos 0.01, sin 0.01].
Constraints
Cis even; pair the last axis as and rotate each pair independently (no cross-pair mixing) —x.reshape(B, T, C//2, 2)is the clean way.- Use the rotation angle for pair at position ; broadcast
cos/sinof the(T, C/2)angle table over the batch. - Position 0 is the identity; the operation is norm-preserving — per position (
atol≈1e-6). - Must run cleanly (no
NaN/Inf) at realistic sizes (e.g.C=32, T=16).
Notes
- Sign matters. Get the cos/sin signs wrong and the "rotation" becomes a reflection — norms are still preserved, but the position-0-identity check fails, which is exactly what the diagnostic catches.
- Frequency spread. Low-index pairs rotate fast (), high-index pairs slowly (), so attention can read short- or long-range patterns from different frequency bands. Contrast with additive sinusoidal PE.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape matches input
- •Diagnostic: position t=0 is the identity (theta=0 -> rotation by 0)
- •Norm-preserving: ||rotated|| == ||original|| (rotation preserves L2 norm)
- •Hand-checked rotation at position 1, dim pair 0
- •Pairs of dimensions are rotated independently (no cross-pair mixing)
- •Larger C and T runs without error