Bellman equation for value iteration
Background
Value iteration solves a known MDP by repeatedly applying the Bellman optimality update. The value of a state is the best expected return achievable from it: over all actions, take the one with the highest expected immediate reward plus discounted next-state value. Sweeping this update over all states until it converges yields the optimal value function .
Problem statement
Implement bellman_update(V, transitions, gamma) performing one synchronous Bellman optimality sweep:
Input
V—np.ndarray(n_states,), current value estimates.transitions— list indexed by state;transitions[s]is a dictaction -> list of (prob, next_state, reward, done)tuples.gamma—float, discount factor.
Output
An np.ndarray (n_states,): the updated values new_V.
Examples
Example 1
transitions = [
{0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},
{0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]},
]
V = [0.0, 0.0], gamma = 0.9
Output: [1.0, 1.0]
Explanation: from state 0 the best action (1) gives reward 1 → value 1. From state 1, action 1 gives reward 1 and ends the episode (done=True), so no future value is added → value 1.
Constraints
- For each state, take the max over actions of the expected backup.
- A transition with
done=Truecontributes only its reward (no term). - The update is synchronous: compute
new_Vfrom the oldV, don't update in place mid-sweep.
Notes
- Iterating
bellman_updateto a fixed point gives the optimal values ; the greedy policy w.r.t. is optimal. - Replacing the
maxwith a fixed policy's action gives policy evaluation instead of value iteration.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example
- •done=True drops the discounted future value
- •Takes the max over actions
- •Discounts the next-state value when not done
- •Averages over stochastic transitions