Adagrad optimizer
Background
Adagrad gives every parameter its own learning rate by dividing the global rate by the square root of the sum of all past squared gradients. Parameters with large/frequent gradients take smaller steps; rare-feature parameters take larger steps — handy for sparse data. Its weakness: the accumulator only grows, so the effective rate eventually shrinks toward zero (which RMSprop/Adam fix).
Problem statement
Implement adagrad_optimizer(parameter, grad, G, learning_rate=0.01, epsilon=1e-8) for one update step:
Return the updated parameter and accumulator, both rounded to 5 decimals.
Input
parameter— current parameter value(s) (scalar ornp.ndarray).grad— current gradient , same shape.G— accumulated squared gradients so far (same shape; starts at 0).learning_rate—float, the global rate .epsilon—float.
Output
Returns a tuple (updated_parameter, updated_G), each rounded to 5 decimals.
Examples
Example 1
Input: parameter = 1.0, grad = 0.1, G = 1.0
Output: (0.999, 1.01)
Explanation: ; the step is , so .
Constraints
- Update before computing the step (accumulate ).
- Step , subtracted from the parameter.
- Round both outputs to 5 decimals.
Notes
- is the sum (not a running average) of squared gradients, so it never decreases — the per-parameter rate is monotonically non-increasing.
- That monotonic decay is Adagrad's flaw on long runs; RMSprop replaces the sum with an exponential moving average to keep learning.
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example
- •Accumulator adds grad squared
- •Larger accumulated G yields a smaller step
- •Works elementwise on arrays