GRPO objective
Background
Group Relative Policy Optimization (GRPO) is the RL objective behind models like DeepSeek-R1. It is a PPO variant that drops the value network: instead of a learned baseline, it estimates advantages by comparing the rewards of a group of sampled responses to the same prompt. It keeps PPO's clipped ratio for stability and adds a KL penalty pulling the policy toward a reference model.
Problem statement
Implement grpo_objective(rhos, A, pi_theta_old, pi_theta_ref, epsilon=0.2, beta=0.01):
Recover , normalize both and to sum to 1, and use .
Input
rhos— likelihood ratios , lengthG.A— advantage estimates, lengthG.pi_theta_old,pi_theta_ref— old-policy and reference probabilities, lengthG.epsilon,beta— clip range and KL coefficient.
Raise ValueError if the lengths differ.
Output
A scalar float: the GRPO objective.
Examples
Example 1
grpo_objective([1.2, 0.8, 1.1], [1.0, 1.0, 1.0], [0.9, 1.1, 1.0], [1.0, 0.5, 1.5], 0.2, 0.01)
-> 1.032749
Explanation: ratios are within so nothing clips; the average min term is . Subtracting the small KL penalty () gives .
Constraints
- Clip to and take the elementwise min with the unclipped term, then average.
- Build , then normalize and to probability distributions before the KL.
- Add inside the log for numerical stability.
Notes
- Dropping the value network (PPO's critic) is GRPO's main simplification — group-relative advantages replace it, which is cheaper for LLM-scale RL.
- The KL term keeps the fine-tuned policy close to the reference, preventing reward-hacking drift.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example
- •Matching policies make KL ~ 0; objective ~ mean advantage
- •Large ratio with positive advantage is clipped
- •Length mismatch raises ValueError
- •Larger beta lowers the objective when KL > 0