Rotary Position Embedding (RoPE)Hard

Rotary Position Embedding (RoPE)

Background

Rotary Position Embedding (RoPE, Su et al. 2021) is the positional scheme that displaced sinusoidal and learned embeddings in modern LLMs — Llama, Mistral, Qwen, and Gemma all use it. Instead of adding a positional signal to the token embeddings, RoPE is applied directly to the query and key vectors before attention: it splits each feature vector into 2D pairs and rotates each pair by an angle that grows with position. Because a rotation preserves length, the attention dot product qkq\cdot k ends up encoding both content similarity and relative position (through the angle difference).

Problem statement

Implement apply_rotary(x) for a (B, T, C) tensor (C even). Split the last axis into C/2 pairs; rotate the pair (x2i,x2i+1)(x_{2i}, x_{2i+1}) at position tt by angle θ=t/100002i/C\theta = t / 10000^{2i/C}:

θ=t100002i/C,[y2iy2i+1]=[cosθsinθsinθcosθ][x2ix2i+1]\theta = \frac{t}{10000^{\,2i/C}}, \qquad \begin{bmatrix} y_{2i} \\ y_{2i+1} \end{bmatrix} = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} x_{2i} \\ x_{2i+1} \end{bmatrix}

i.e. y2i=x2icosθx2i+1sinθy_{2i} = x_{2i}\cos\theta - x_{2i+1}\sin\theta and y2i+1=x2isinθ+x2i+1cosθy_{2i+1} = x_{2i}\sin\theta + x_{2i+1}\cos\theta.

Input

  • xnp.ndarray of shape (B, T, C); C must be even.

Output

Returns an np.ndarray of shape (B, T, C) — RoPE applied.

Examples

Example 1 — position 0 is the identity

Input:  any x; inspect position t=0
Output: out[:, 0, :] == x[:, 0, :]   (unchanged)

Explanation: at t=0t=0 every angle θ=0\theta = 0, so cosθ=1,sinθ=0\cos\theta=1, \sin\theta=0 and each pair is rotated by nothing — the first token passes through untouched.

Example 2 — hand-checked rotation (C=4C=4, position t=1t=1)

Input:  x[0, 1] = [1.0, 0.0, 1.0, 0.0]
Output: [cos(1), sin(1), cos(0.01), sin(0.01)] ≈ [0.5403, 0.8415, 0.99995, 0.01]

Explanation: with C=4C=4 the two pair frequencies are θ0=t/100000=t\theta_0 = t/10000^{0} = t and θ1=t/100002/4=t/100\theta_1 = t/10000^{2/4} = t/100. At t=1t=1: pair 0 [1,0] rotates by 1 rad → [cos 1, sin 1]; pair 1 [1,0] rotates by 0.01 rad → [cos 0.01, sin 0.01].

Constraints

  • C is even; pair the last axis as (2i,2i+1)(2i, 2i+1) and rotate each pair independently (no cross-pair mixing) — x.reshape(B, T, C//2, 2) is the clean way.
  • Use the rotation angle θ=t/100002i/C\theta = t / 10000^{2i/C} for pair ii at position tt; broadcast cos/sin of the (T, C/2) angle table over the batch.
  • Position 0 is the identity; the operation is norm-preservingout=x\lVert\text{out}\rVert = \lVert x\rVert per position (atol≈1e-6).
  • Must run cleanly (no NaN/Inf) at realistic sizes (e.g. C=32, T=16).

Notes

  • Sign matters. Get the cos/sin signs wrong and the "rotation" becomes a reflection — norms are still preserved, but the position-0-identity check fails, which is exactly what the diagnostic catches.
  • Frequency spread. Low-index pairs rotate fast (θt\theta \propto t), high-index pairs slowly (θt/10000\theta \propto t/10000), so attention can read short- or long-range patterns from different frequency bands. Contrast with additive sinusoidal PE.
Python
Loading...

This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Output shape matches input
  • Diagnostic: position t=0 is the identity (theta=0 -> rotation by 0)
  • Norm-preserving: ||rotated|| == ||original|| (rotation preserves L2 norm)
  • Hand-checked rotation at position 1, dim pair 0
  • Pairs of dimensions are rotated independently (no cross-pair mixing)
  • Larger C and T runs without error