Policy gradient with REINFORCEMedium

Policy gradient with REINFORCE

Background

REINFORCE is the foundational policy-gradient algorithm. It directly optimizes a parameterized policy by following the gradient of expected return: actions that preceded high returns are made more likely. For a softmax policy over discrete actions, the per-step gradient of logπ\log\pi has a clean closed form, and each step is weighted by the return (sum of future rewards) from that step onward.

Problem statement

Implement compute_policy_gradient(theta, episodes). The policy is softmax(theta[s]) over actions in state s. For each episode, compute returns Gt=ktrkG_t = \sum_{k\ge t} r_k, and accumulate:

logπ(as)=(1aπ(s)),J+=Gtlogπ(atst)\nabla\log\pi(a\mid s) = (\mathbf 1_a - \pi(\cdot\mid s)), \qquad \nabla J \mathrel{+}= G_t\,\nabla\log\pi(a_t\mid s_t)

Return the gradient averaged over episodes (same shape as theta).

Input

  • thetanp.ndarray (num_states, num_actions), policy logits.
  • episodes — list of episodes; each episode is a list of (state, action, reward) tuples.

Output

An np.ndarray shaped like theta: the mean policy gradient across episodes.

Examples

Example 1

Input:  theta = zeros((2, 2)), episodes = [[(0,1,0), (1,0,1)], [(0,0,0)]]
Output: [[-0.25, 0.25], [0.25, -0.25]]

Explanation: in episode 1 both steps have return 1. At state 0 (action 1), logπ=[0.5,0.5]\nabla\log\pi = [-0.5, 0.5]; at state 1 (action 0), [0.5,0.5][0.5, -0.5]. Episode 2 has return 0, so contributes nothing. Averaging over 2 episodes halves episode 1's gradient.

Constraints

  • Returns are the reverse-cumulative sum of rewards (undiscounted here): Gt=ktrkG_t = \sum_{k\ge t} r_k.
  • For a softmax policy, logπ(as)\nabla\log\pi(a\mid s) places π-\pi on row s and adds 1 at the taken action.
  • Average the accumulated gradient over the number of episodes.

Notes

  • The (1aπ)(\mathbf 1_a - \pi) form comes from differentiating logsoftmax\log\text{softmax} — it pushes probability toward the taken action in proportion to how unexpected it was.
  • Weighting by the return is the heart of REINFORCE; subtracting a baseline (e.g. the mean return) reduces variance without bias.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example
  • Zero rewards give zero gradient
  • Gradient row for the taken action sums to zero (softmax property)
  • Returns are cumulative future rewards
  • Averages across episodes