Adam optimiser step
Background
Adam (Adaptive Moment Estimation) is the default optimiser for training deep networks and transformers. It keeps a per-parameter running estimate of the gradient's first moment (the mean, ) and second moment (the uncentered variance, ), then scales each parameter's step by those estimates — so parameters with consistently large or noisy gradients get smaller, better-conditioned updates. It is the optimiser behind most LLM pre-training; AdamW, its weight-decay variant, differs by a single term. Knowing it cold is table-stakes for an ML interview.
Problem statement
Implement adam_step(params, grads, m, v, t, lr, beta1, beta2, eps) that applies one Adam update, in place, to every parameter tensor. For each parameter with gradient and moment buffers :
Defaults from the original paper: lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8.
Input
params—list[np.ndarray], the model parameters. Mutated in place.grads—list[np.ndarray], same shapes asparams; the gradient for each.m—list[np.ndarray], first-moment buffers (start at 0). Mutated in place.v—list[np.ndarray], second-moment buffers (start at 0). Mutated in place.t—int, , the current step (1-indexed). Drives bias correction.lr, beta1, beta2, eps— scalars: learning rate, first/second-moment decay rates, and the stabiliser .
Output
Returns None. Updates params, m, and v in place, so the caller's references see the new values after the call.
Examples
Example 1 — first step from zero state
Input: params=[1.0], grads=[0.1], m=[0.0], v=[0.0],
t=1, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8
Output: m=[0.01], v=[1e-5], params≈[0.999]
Explanation: , so . , so . Then , giving .
Example 2 — bias correction at
Input: params=[10.0], grads=[2.0], m=[0.0], v=[0.0],
t=1, lr=1.0, beta1=0.9, beta2=0.999, eps=1e-12
Output: params≈[9.0]
Explanation: at the denominator exactly cancels the numerator, so and . The step is , so . The first step is regardless of — this is what keeps Adam from stalling at step 1.
Constraints
- Within each tuple,
params[i],grads[i],m[i],v[i]share the same shape. - and is the true step count — bias correction divides by .
- Update in place (
*=,+=,-=). Rebindingm/vdiscards the optimiser state across steps. - is added after the square root: .
- Tests compare with tolerance
atol ≈ 1e-6.
Notes
- Bias correction.
mandvare initialised to 0, biasing them toward 0 early on; dividing by corrects this and vanishes as grows. - State persistence.
mandvare optimiser state that lives across steps — the in-place requirement is what lets a training loop reuse them.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •First step from zero state matches the bias-corrected formula
- •Diagnostic: at step 1, m_hat = g and v_hat = g^2 (bias correction effect)
- •Constant gradient over many steps: parameter drifts steadily
- •Multiple parameter tensors all update in their own m and v buffers
- •Mutation is in place — caller's references see the new values
- •eps prevents division-by-zero when v_hat is exactly 0