GRPO objectiveHard

GRPO objective

Background

Group Relative Policy Optimization (GRPO) is the RL objective behind models like DeepSeek-R1. It is a PPO variant that drops the value network: instead of a learned baseline, it estimates advantages by comparing the rewards of a group of sampled responses to the same prompt. It keeps PPO's clipped ratio for stability and adds a KL penalty pulling the policy toward a reference model.

Problem statement

Implement grpo_objective(rhos, A, pi_theta_old, pi_theta_ref, epsilon=0.2, beta=0.01):

L=1Gimin ⁣(ρiAi, clip(ρi,1ϵ,1+ϵ)Ai)clipped surrogate    βDKL(πθπref)L = \underbrace{\frac{1}{G}\sum_i \min\!\big(\rho_i A_i,\ \operatorname{clip}(\rho_i, 1-\epsilon, 1+\epsilon)A_i\big)}_{\text{clipped surrogate}} \;-\; \beta\, D_{KL}(\pi_\theta \,\|\, \pi_{\text{ref}})

Recover πθρπold\pi_\theta \propto \rho \cdot \pi_{\text{old}}, normalize both πθ\pi_\theta and πref\pi_{\text{ref}} to sum to 1, and use DKL=πθlog(πθ/πref+1010)D_{KL} = \sum \pi_\theta \log(\pi_\theta / \pi_{\text{ref}} + 10^{-10}).

Input

  • rhos — likelihood ratios ρi=πθ(oi)/πold(oi)\rho_i = \pi_\theta(o_i)/\pi_{\text{old}}(o_i), length G.
  • A — advantage estimates, length G.
  • pi_theta_old, pi_theta_ref — old-policy and reference probabilities, length G.
  • epsilon, beta — clip range and KL coefficient.

Raise ValueError if the lengths differ.

Output

A scalar float: the GRPO objective.

Examples

Example 1

grpo_objective([1.2, 0.8, 1.1], [1.0, 1.0, 1.0], [0.9, 1.1, 1.0], [1.0, 0.5, 1.5], 0.2, 0.01)
-> 1.032749

Explanation: ratios are within 1±0.21\pm0.2 so nothing clips; the average min term is 3.1/3=1.03333.1/3=1.0333. Subtracting the small KL penalty (β0.0584\beta\cdot 0.0584) gives 1.0327\approx 1.0327.

Constraints

  • Clip ρ\rho to [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon] and take the elementwise min with the unclipped term, then average.
  • Build πθ=ρπold\pi_\theta = \rho \cdot \pi_{\text{old}}, then normalize πθ\pi_\theta and πref\pi_{\text{ref}} to probability distributions before the KL.
  • Add 101010^{-10} inside the log for numerical stability.

Notes

  • Dropping the value network (PPO's critic) is GRPO's main simplification — group-relative advantages replace it, which is cheaper for LLM-scale RL.
  • The KL term keeps the fine-tuned policy close to the reference, preventing reward-hacking drift.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example
  • Matching policies make KL ~ 0; objective ~ mean advantage
  • Large ratio with positive advantage is clipped
  • Length mismatch raises ValueError
  • Larger beta lowers the objective when KL > 0