Warmup + cosine decay LR schedule
Background
This is the standard learning-rate schedule for modern transformer training — a linear warmup over the first few hundred steps, then a smooth cosine decay down to a floor, then a flat tail. GPT-2, GPT-3, Llama, and most public LLM runs use it. It is a pure function of the step: no state, no side effects, just the LR to use right now.
Problem statement
Implement lr_at(step, lr_max, lr_min, warmup_iters, max_iters) with three regimes:
At both branches equal ; at the cosine term gives , so the LR is exactly .
Input
step—int: the current training step.lr_max— peak LR, reached atstep == warmup_iters.lr_min— floor LR, reached atstep == max_iters.warmup_iters—int: duration of the linear warmup.max_iters—int: total training duration; after this, returnslr_min.
Output
Returns a float: the learning rate at this step.
Examples
Example 1 — the boundary points (lr_max=3e-4, lr_min=3e-5, warmup=100, max=1000)
step = 0 -> 0.0 # start of warmup
step = 100 -> 3e-4 # end of warmup == lr_max
step = 1000 -> 3e-5 # end of decay == lr_min
step = 5000 -> 3e-5 # flat tail past max_iters
Explanation: a linear ramp from 0 to lr_max over [0, warmup), a cosine decay from lr_max to lr_min over [warmup, max), then a constant lr_min.
Example 2 — the smooth midpoints
step = 50 (warmup/2) -> 0.5 * lr_max (linear half-way)
step = 550 (warmup + (max-warmup)/2) -> (lr_max + lr_min)/2 (cos(pi/2)=0)
Explanation: halfway through warmup the linear ramp is at half of lr_max; halfway through decay the cosine argument is , so and the LR sits exactly between lr_max and lr_min.
Constraints
- Three branches: linear warmup, cosine decay, then flat at
lr_min. - , so gives
lr_maxand giveslr_min. - The schedule is continuous at both boundaries (no jump at
step==warmup, no cliff atstep==max). - Pure function — same inputs always give the same output, with no side effects.
Notes
- Why warmup. Adam's second-moment estimate is noisy in the first few hundred steps; hitting fragile freshly-initialised weights with
lr_maxcan blow up training. Warmup ramps the LR while the optimiser stabilises. - Why cosine + flat tail. Cosine is smooth (differentiable) at both endpoints, avoiding restart-shock; the flat
lr_mintail lets you train pastmax_iterswithout an abrupt LR cliff.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Step 0 (start of warmup): lr is exactly 0
- •End of warmup: lr is exactly lr_max
- •End of training: lr is exactly lr_min
- •Past max_iters: lr stays at lr_min (flat tail)
- •Diagnostic: midway through cosine decay, lr is at the half-amplitude point
- •Linear warmup is, well, linear