transformer
20 problems
- Token embedding lookupEasy
- Sinusoidal positional encodingEasy
- Scaled dot-product self-attentionMedium
- Causal mask: build + applyEasy
- Multi-head split + combineMedium
- Multi-Head Attention (full layer)Medium
- LayerNorm forwardEasy
- Transformer block forward (pre-LN, residual)Hard
- Dynamic-Tanh (DyT)Medium
- Efficient sparse window attentionMedium
- FlashAttention tiled forwardHard
- GELU forward (tanh approximation)Easy
- Grouped-query attentionMedium
- KV cache for autoregressive inferenceMedium
- KV cache compression (MLA)Hard
- Noisy top-k gatingMedium
- RMSNormEasy
- Rotary Position Embedding (RoPE)Hard
- Sparse mixture-of-experts layerHard
- SwiGLU activationMedium