SGD with momentum (one step)
Background
SGD with momentum is the simplest optimiser upgrade after vanilla SGD, and still the dominant choice for ConvNets. The "heavy-ball" intuition: instead of stepping straight down the current gradient, accumulate a velocity that smooths over recent gradients — steady descents speed up, while stochastic noise averages out. Adam and AdamW build directly on this idea.
Problem statement
Implement sgd_momentum_step(params, grads, velocities, lr, momentum), mutating both params and velocities in place. For each parameter , gradient , velocity (PyTorch convention — no learning rate inside the velocity):
where momentum. Return None.
Input
params—list[np.ndarray]: the parameters. Mutated in place.grads—list[np.ndarray]: the gradients, same shapes.velocities—list[np.ndarray]: the momentum buffers, same shapes. Mutated in place.lr— scalar learning rate.momentum— scalar in (typically 0.9).
Output
Returns None. Updates params and velocities in place.
Examples
Example 1 — the first step from zero velocity is plain SGD
Input: params=[[1,2,3]], grads=[[0.1,0.2,0.3]], velocities=[[0,0,0]], lr=0.5, momentum=0.9
Output: velocities -> [[0.1, 0.2, 0.3]], params -> [[0.95, 1.9, 2.85]]
Explanation: from zero velocity, , then . Identical to vanilla SGD — momentum only kicks in once velocity accumulates over steps.
Example 2 — velocity converges for a constant gradient
Input: constant g=1, momentum=0.9, lr=0 (freeze params), many steps
Output: velocity -> 1/(1 - 0.9) = 10
Explanation: with a constant gradient the update is a geometric series converging to ; for that's . This accumulation is the "speed-up" momentum provides on steady descents.
Constraints
- Update in place:
v *= momentum; v += g; p -= lr * v. Writingv = momentum*v + grebinds the local name and leaves the caller's buffer untouched. - Velocity buffers persist across steps — reusing them is the whole point; fresh arrays each step would degrade to memoryless SGD.
momentum = 0reduces to a plain SGD step ().- For a constant gradient, the velocity limit is .
Notes
- Convention. This is PyTorch's form (learning rate applied outside the velocity). Sutskever's older form folds
lrintov; the two are equivalent up to scaling, but the PyTorch form is the common one today. - Tuning.
lr=0.01, momentum=0.9is the classic ResNet-style setting. Transformers usually prefer AdamW, but every momentum-based optimiser starts from this two-line update — cf. plain SGD step.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Single step from zero velocity is equivalent to plain SGD
- •After several steps with constant gradient, velocity approaches g/(1-momentum)
- •Diagnostic: with momentum=0, equivalent to plain SGD step
- •Multiple parameter tensors all update
- •Mutation is in place — caller's references see the new values