TD(0) value updateEasy

TD(0) value update

Background

TD(0) is the simplest temporal-difference learning rule for policy evaluation: it estimates the value function V(s)V(s) of a fixed policy from experience, without a model and without waiting for an episode to end. After each transition it nudges V(s)V(s) toward a bootstrapped target — the observed reward plus the discounted estimate of the next state's value.

Problem statement

Implement td0_update(V, s, r, s_next, done, alpha, gamma) performing one TD(0) update and returning the new value table:

δ=r+γ(1done)V(s)V(s),V(s)V(s)+αδ\delta = r + \gamma\,(1-\text{done})\,V(s') - V(s), \qquad V(s) \leftarrow V(s) + \alpha\,\delta

Only V[s] changes; return the (copied) updated array.

Input

  • Vnp.ndarray (n_states,), current value estimates.
  • s, s_nextint, current and next state indices.
  • rfloat, observed reward.
  • donebool, whether s_next is terminal.
  • alpha, gammafloat, learning rate and discount.

Output

An np.ndarray (n_states,): a copy of V with V[s] updated.

Examples

Example 1

Input:  V = [0, 0, 0], s = 0, r = 1.0, s_next = 1, done = False, alpha = 0.5, gamma = 0.9
Output: [0.5, 0, 0]

Explanation: TD error δ=1+0.9V[1]V[0]=1\delta = 1 + 0.9\cdot V[1] - V[0] = 1. The update is V[0]+=0.51=0.5V[0] \mathrel{+}= 0.5\cdot 1 = 0.5.

Constraints

  • The target drops the γV(s)\gamma V(s') term when done is True.
  • Only V[s] is modified; the rest of the array is unchanged.
  • Return a copy (do not mutate the caller's array in place).

Notes

  • TD(0) bootstraps: it learns a guess from a guess (V(s)V(s')), so it can update online, mid-episode, unlike Monte-Carlo which waits for the full return.
  • The TD error δ\delta is exactly the signal used by actor-critic methods and is the building block of TD(λ\lambda).
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example
  • done=True drops the bootstrap term
  • Bootstraps from the next-state value
  • Only V[s] changes; input is not mutated
  • alpha=0 leaves the value unchanged