SGD with momentum (one step)Easy

SGD with momentum (one step)

Background

SGD with momentum is the simplest optimiser upgrade after vanilla SGD, and still the dominant choice for ConvNets. The "heavy-ball" intuition: instead of stepping straight down the current gradient, accumulate a velocity that smooths over recent gradients — steady descents speed up, while stochastic noise averages out. Adam and AdamW build directly on this idea.

Problem statement

Implement sgd_momentum_step(params, grads, velocities, lr, momentum), mutating both params and velocities in place. For each parameter pp, gradient gg, velocity vv (PyTorch convention — no learning rate inside the velocity):

vμv+g,pplrvv \leftarrow \mu\, v + g, \qquad p \leftarrow p - \mathrm{lr}\cdot v

where μ=\mu = momentum. Return None.

Input

  • paramslist[np.ndarray]: the parameters. Mutated in place.
  • gradslist[np.ndarray]: the gradients, same shapes.
  • velocitieslist[np.ndarray]: the momentum buffers, same shapes. Mutated in place.
  • lr — scalar learning rate.
  • momentum — scalar in [0,1)[0, 1) (typically 0.9).

Output

Returns None. Updates params and velocities in place.

Examples

Example 1 — the first step from zero velocity is plain SGD

Input:  params=[[1,2,3]], grads=[[0.1,0.2,0.3]], velocities=[[0,0,0]], lr=0.5, momentum=0.9
Output: velocities -> [[0.1, 0.2, 0.3]],  params -> [[0.95, 1.9, 2.85]]

Explanation: from zero velocity, v=0.90+g=gv = 0.9\cdot 0 + g = g, then p=0.5gp \mathrel{-}= 0.5\,g. Identical to vanilla SGD — momentum only kicks in once velocity accumulates over steps.

Example 2 — velocity converges for a constant gradient

Input:  constant g=1, momentum=0.9, lr=0 (freeze params), many steps
Output: velocity -> 1/(1 - 0.9) = 10

Explanation: with a constant gradient the update vμv+gv \leftarrow \mu v + g is a geometric series converging to g/(1μ)g/(1-\mu); for μ=0.9\mu=0.9 that's 10g10g. This accumulation is the "speed-up" momentum provides on steady descents.

Constraints

  • Update in place: v *= momentum; v += g; p -= lr * v. Writing v = momentum*v + g rebinds the local name and leaves the caller's buffer untouched.
  • Velocity buffers persist across steps — reusing them is the whole point; fresh arrays each step would degrade to memoryless SGD.
  • momentum = 0 reduces to a plain SGD step (v=gv = g).
  • For a constant gradient, the velocity limit is g/(1μ)g/(1-\mu).

Notes

  • Convention. This is PyTorch's form (learning rate applied outside the velocity). Sutskever's older form folds lr into v; the two are equivalent up to scaling, but the PyTorch form is the common one today.
  • Tuning. lr=0.01, momentum=0.9 is the classic ResNet-style setting. Transformers usually prefer AdamW, but every momentum-based optimiser starts from this two-line update — cf. plain SGD step.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Single step from zero velocity is equivalent to plain SGD
  • After several steps with constant gradient, velocity approaches g/(1-momentum)
  • Diagnostic: with momentum=0, equivalent to plain SGD step
  • Multiple parameter tensors all update
  • Mutation is in place — caller's references see the new values