Bellman equation for value iterationMedium

Bellman equation for value iteration

Background

Value iteration solves a known MDP by repeatedly applying the Bellman optimality update. The value of a state is the best expected return achievable from it: over all actions, take the one with the highest expected immediate reward plus discounted next-state value. Sweeping this update over all states until it converges yields the optimal value function VV^*.

Problem statement

Implement bellman_update(V, transitions, gamma) performing one synchronous Bellman optimality sweep:

V(s)=maxa(p,s,r,done)p(r+γ(1done)V(s))V'(s) = \max_a \sum_{(p,\,s',\,r,\,\text{done})} p\,\big(r + \gamma\,(1-\text{done})\,V(s')\big)

Input

  • Vnp.ndarray (n_states,), current value estimates.
  • transitions — list indexed by state; transitions[s] is a dict action -> list of (prob, next_state, reward, done) tuples.
  • gammafloat, discount factor.

Output

An np.ndarray (n_states,): the updated values new_V.

Examples

Example 1

transitions = [
  {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, False)]},
  {0: [(1.0, 0, 0.0, False)], 1: [(1.0, 1, 1.0, True)]},
]
V = [0.0, 0.0], gamma = 0.9
Output: [1.0, 1.0]

Explanation: from state 0 the best action (1) gives reward 1 → value 1. From state 1, action 1 gives reward 1 and ends the episode (done=True), so no future value is added → value 1.

Constraints

  • For each state, take the max over actions of the expected backup.
  • A transition with done=True contributes only its reward (no γV(s)\gamma V(s') term).
  • The update is synchronous: compute new_V from the old V, don't update in place mid-sweep.

Notes

  • Iterating bellman_update to a fixed point gives the optimal values VV^*; the greedy policy w.r.t. VV^* is optimal.
  • Replacing the max with a fixed policy's action gives policy evaluation instead of value iteration.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example
  • done=True drops the discounted future value
  • Takes the max over actions
  • Discounts the next-state value when not done
  • Averages over stochastic transitions