TD(0) value update
Background
TD(0) is the simplest temporal-difference learning rule for policy evaluation: it estimates the value function of a fixed policy from experience, without a model and without waiting for an episode to end. After each transition it nudges toward a bootstrapped target — the observed reward plus the discounted estimate of the next state's value.
Problem statement
Implement td0_update(V, s, r, s_next, done, alpha, gamma) performing one TD(0) update and returning the new value table:
Only V[s] changes; return the (copied) updated array.
Input
V—np.ndarray(n_states,), current value estimates.s,s_next—int, current and next state indices.r—float, observed reward.done—bool, whethers_nextis terminal.alpha,gamma—float, learning rate and discount.
Output
An np.ndarray (n_states,): a copy of V with V[s] updated.
Examples
Example 1
Input: V = [0, 0, 0], s = 0, r = 1.0, s_next = 1, done = False, alpha = 0.5, gamma = 0.9
Output: [0.5, 0, 0]
Explanation: TD error . The update is .
Constraints
- The target drops the term when
doneis True. - Only
V[s]is modified; the rest of the array is unchanged. - Return a copy (do not mutate the caller's array in place).
Notes
- TD(0) bootstraps: it learns a guess from a guess (), so it can update online, mid-episode, unlike Monte-Carlo which waits for the full return.
- The TD error is exactly the signal used by actor-critic methods and is the building block of TD().
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example
- •done=True drops the bootstrap term
- •Bootstraps from the next-state value
- •Only V[s] changes; input is not mutated
- •alpha=0 leaves the value unchanged