Adamax optimizerMedium

Adamax optimizer

Background

Adamax is a variant of Adam in which the second-moment EMA is replaced by an infinity norm of past gradients. Instead of an exponential average of squared gradients, it keeps a running maximum of (decayed) gradient magnitudes. This makes the per-parameter scale more stable and bounded, and the update rule simpler.

Problem statement

Implement adamax_optimizer(parameter, grad, m, u, t, learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-8) for one step:

mβ1m+(1β1)g,umax(β2u, g)m \leftarrow \beta_1 m + (1-\beta_1)g, \qquad u \leftarrow \max(\beta_2\, u,\ |g|) m^=m1β1t,θθηm^u+ϵ\hat m = \frac{m}{1-\beta_1^{\,t}}, \qquad \theta \leftarrow \theta - \frac{\eta\, \hat m}{u + \epsilon}

Return the updated parameter, mm, and uu, each rounded to 5 decimals.

Input

  • parameter, grad — current value(s) and gradient.
  • m — first-moment estimate (starts at 0).
  • u — infinity-norm estimate (starts at 0, stays non-negative).
  • t — current timestep (int, 1\ge 1), used for bias correction.
  • learning_ratefloat, η\eta (default 0.002).
  • beta1, beta2float, the moment decay rates.
  • epsilonfloat.

Output

Returns (updated_parameter, updated_m, updated_u), each rounded to 5 decimals.

Examples

Example 1

Input:  parameter = 1.0, grad = 0.1, m = 0.0, u = 0.0, t = 1
        (learning_rate = 0.002, beta1 = 0.9, beta2 = 0.999)
Output: (0.998, 0.01, 0.1)

Explanation: m=0.9(0)+0.1(0.1)=0.01m = 0.9(0)+0.1(0.1)=0.01; u=max(0.9990, 0.1)=0.1u = \max(0.999\cdot 0,\ |0.1|)=0.1; m^=0.01/(10.9)=0.1\hat m = 0.01/(1-0.9)=0.1; step =0.0020.1/(0.1+ϵ)=0.002=0.002\cdot 0.1/(0.1+\epsilon)=0.002, so θ=1.00.002=0.998\theta = 1.0 - 0.002 = 0.998.

Constraints

  • The infinity-norm uses np.maximum(beta2*u, abs(grad)) — a max, not an EMA of squares.
  • Only the first moment mm is bias-corrected (by 1β1t1-\beta_1^{\,t}); uu is not.
  • Round all three outputs to 5 decimals.

Notes

  • Because uu tracks a maximum, the effective step size is bounded by η(1β1t)1\eta\,(1-\beta_1^{\,t})^{-1} — Adamax is robust to occasional huge gradients.
  • It needs no bias correction on uu: the max of decayed magnitudes is already a stable scale estimate.
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example
  • u is the infinity norm (a running max)
  • m is bias-corrected by 1 - beta1**t
  • Works elementwise on arrays