Adamax optimizer
Background
Adamax is a variant of Adam in which the second-moment EMA is replaced by an infinity norm of past gradients. Instead of an exponential average of squared gradients, it keeps a running maximum of (decayed) gradient magnitudes. This makes the per-parameter scale more stable and bounded, and the update rule simpler.
Problem statement
Implement adamax_optimizer(parameter, grad, m, u, t, learning_rate=0.002, beta1=0.9, beta2=0.999, epsilon=1e-8) for one step:
Return the updated parameter, , and , each rounded to 5 decimals.
Input
parameter,grad— current value(s) and gradient.m— first-moment estimate (starts at 0).u— infinity-norm estimate (starts at 0, stays non-negative).t— current timestep (int, ), used for bias correction.learning_rate—float, (default 0.002).beta1,beta2—float, the moment decay rates.epsilon—float.
Output
Returns (updated_parameter, updated_m, updated_u), each rounded to 5 decimals.
Examples
Example 1
Input: parameter = 1.0, grad = 0.1, m = 0.0, u = 0.0, t = 1
(learning_rate = 0.002, beta1 = 0.9, beta2 = 0.999)
Output: (0.998, 0.01, 0.1)
Explanation: ; ; ; step , so .
Constraints
- The infinity-norm uses
np.maximum(beta2*u, abs(grad))— a max, not an EMA of squares. - Only the first moment is bias-corrected (by ); is not.
- Round all three outputs to 5 decimals.
Notes
- Because tracks a maximum, the effective step size is bounded by — Adamax is robust to occasional huge gradients.
- It needs no bias correction on : the max of decayed magnitudes is already a stable scale estimate.
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example
- •u is the infinity norm (a running max)
- •m is bias-corrected by 1 - beta1**t
- •Works elementwise on arrays