PPO clipped objective
Background
Proximal Policy Optimization (PPO) is the workhorse RL algorithm behind RLHF. Its key idea is the clipped surrogate objective: it improves the policy using the probability ratio between the new and old policies, but clips that ratio so a single update cannot move the policy too far. This keeps training stable without the expensive second-order math of trust-region methods.
Problem statement
Implement ppo_clip_objective(log_probs, old_log_probs, advantages, epsilon=0.2) returning the (mean) clipped surrogate objective to maximize:
Input
log_probs—np.ndarray(N,), log-probs of taken actions under the current policy.old_log_probs—np.ndarray(N,), log-probs under the policy that collected the data.advantages—np.ndarray(N,), advantage estimates .epsilon—float, clip range.
Output
A scalar float: the mean clipped surrogate objective.
Examples
Example 1
Input: log_probs = old_log_probs (ratio = 1), advantages = [2.0], epsilon = 0.2
Output: 2.0
Explanation: when the ratio is exactly 1, both the clipped and unclipped terms equal , so the objective is 2.0.
Constraints
- The ratio is
exp(log_probs - old_log_probs). - Take the elementwise min of the unclipped and clipped terms, then average.
- Clip the ratio to
[1-epsilon, 1+epsilon]before multiplying by the advantage.
Notes
- For a positive advantage with a large ratio,
minselects the clipped term — capping how much the objective rewards moving further; for a negative advantage with ratio > 1, the unclipped (more negative) term wins, discouraging the move. - The clip is what makes PPO "proximal": it removes the incentive to push the ratio beyond .
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Ratio = 1 gives mean advantage
- •Positive advantage with large ratio is clipped
- •Negative advantage with ratio>1 uses the unclipped (more negative) term
- •Objective is the batch mean
- •Smaller epsilon clips more aggressively for positive advantage