UCB1 multi-armed banditEasy

UCB1 multi-armed bandit

Background

The multi-armed bandit problem asks: given several actions ("arms") with unknown reward distributions, how do you balance exploration (trying uncertain arms) against exploitation (pulling the current best)? UCB1 (Upper Confidence Bound) gives each arm an optimistic score — its average reward plus an exploration bonus that grows for rarely-pulled arms — and pulls the arm with the highest score. This "optimism in the face of uncertainty" yields strong regret guarantees.

Problem statement

Implement ucb1_select(counts, values, t) returning the index of the arm to pull next:

UCBi=xˉi+2lntni,select argmaxiUCBi\text{UCB}_i = \bar x_i + \sqrt{\frac{2\ln t}{n_i}}, \qquad \text{select } \arg\max_i \text{UCB}_i

If any arm has counts[i] == 0, it must be selected first (infinite bonus). t is the current round (total pulls so far + 1).

Input

  • countsnp.ndarray (K,), number of times each arm has been pulled.
  • valuesnp.ndarray (K,), current mean reward estimate per arm.
  • tint, the current round number (>= 1).

Output

An int: the index of the selected arm.

Examples

Example 1

Input:  counts = [1, 1, 1], values = [0.1, 0.5, 0.2], t = 4
Output: 1

Explanation: every arm has the same exploration bonus 2ln4/1\sqrt{2\ln 4 / 1}, so the choice is driven by the mean — arm 1 has the highest value (0.5).

Constraints

  • Any arm with counts[i] == 0 is selected immediately (treat its UCB as ++\infty).
  • The bonus is 2lnt/ni\sqrt{2\ln t / n_i} added to the mean value.
  • Break ties by lowest index (NumPy argmax).

Notes

  • The bonus shrinks as an arm is pulled more (nin_i\uparrow) and grows slowly with time (lnt\ln t), so under-explored arms eventually get revisited.
  • UCB1 is deterministic given the statistics — unlike ϵ\epsilon-greedy or Thompson sampling, it needs no randomness to explore.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example picks the highest-value arm under equal bonuses
  • Unpulled arm is selected first
  • Rarely-pulled arm gets a larger exploration bonus
  • Pure exploitation when counts and bonuses are equal
  • Matches the explicit UCB formula