Policy gradient with REINFORCE
Background
REINFORCE is the foundational policy-gradient algorithm. It directly optimizes a parameterized policy by following the gradient of expected return: actions that preceded high returns are made more likely. For a softmax policy over discrete actions, the per-step gradient of has a clean closed form, and each step is weighted by the return (sum of future rewards) from that step onward.
Problem statement
Implement compute_policy_gradient(theta, episodes). The policy is softmax(theta[s]) over actions in state s. For each episode, compute returns , and accumulate:
Return the gradient averaged over episodes (same shape as theta).
Input
theta—np.ndarray(num_states, num_actions), policy logits.episodes— list of episodes; each episode is a list of(state, action, reward)tuples.
Output
An np.ndarray shaped like theta: the mean policy gradient across episodes.
Examples
Example 1
Input: theta = zeros((2, 2)), episodes = [[(0,1,0), (1,0,1)], [(0,0,0)]]
Output: [[-0.25, 0.25], [0.25, -0.25]]
Explanation: in episode 1 both steps have return 1. At state 0 (action 1), ; at state 1 (action 0), . Episode 2 has return 0, so contributes nothing. Averaging over 2 episodes halves episode 1's gradient.
Constraints
- Returns are the reverse-cumulative sum of rewards (undiscounted here): .
- For a softmax policy, places on row
sand adds 1 at the taken action. - Average the accumulated gradient over the number of episodes.
Notes
- The form comes from differentiating — it pushes probability toward the taken action in proportion to how unexpected it was.
- Weighting by the return is the heart of REINFORCE; subtracting a baseline (e.g. the mean return) reduces variance without bias.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example
- •Zero rewards give zero gradient
- •Gradient row for the taken action sums to zero (softmax property)
- •Returns are cumulative future rewards
- •Averages across episodes