UCB1 multi-armed bandit
Background
The multi-armed bandit problem asks: given several actions ("arms") with unknown reward distributions, how do you balance exploration (trying uncertain arms) against exploitation (pulling the current best)? UCB1 (Upper Confidence Bound) gives each arm an optimistic score — its average reward plus an exploration bonus that grows for rarely-pulled arms — and pulls the arm with the highest score. This "optimism in the face of uncertainty" yields strong regret guarantees.
Problem statement
Implement ucb1_select(counts, values, t) returning the index of the arm to pull next:
If any arm has counts[i] == 0, it must be selected first (infinite bonus). t is the current round (total pulls so far + 1).
Input
counts—np.ndarray(K,), number of times each arm has been pulled.values—np.ndarray(K,), current mean reward estimate per arm.t—int, the current round number (>= 1).
Output
An int: the index of the selected arm.
Examples
Example 1
Input: counts = [1, 1, 1], values = [0.1, 0.5, 0.2], t = 4
Output: 1
Explanation: every arm has the same exploration bonus , so the choice is driven by the mean — arm 1 has the highest value (0.5).
Constraints
- Any arm with
counts[i] == 0is selected immediately (treat its UCB as ). - The bonus is added to the mean value.
- Break ties by lowest index (NumPy
argmax).
Notes
- The bonus shrinks as an arm is pulled more () and grows slowly with time (), so under-explored arms eventually get revisited.
- UCB1 is deterministic given the statistics — unlike -greedy or Thompson sampling, it needs no randomness to explore.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example picks the highest-value arm under equal bonuses
- •Unpulled arm is selected first
- •Rarely-pulled arm gets a larger exploration bonus
- •Pure exploitation when counts and bonuses are equal
- •Matches the explicit UCB formula