PPO clipped objectiveMedium

PPO clipped objective

Background

Proximal Policy Optimization (PPO) is the workhorse RL algorithm behind RLHF. Its key idea is the clipped surrogate objective: it improves the policy using the probability ratio between the new and old policies, but clips that ratio so a single update cannot move the policy too far. This keeps training stable without the expensive second-order math of trust-region methods.

Problem statement

Implement ppo_clip_objective(log_probs, old_log_probs, advantages, epsilon=0.2) returning the (mean) clipped surrogate objective to maximize:

rt=elogπθ(at)logπold(at),L=1Ntmin ⁣(rtAt, clip(rt,1ϵ,1+ϵ)At)r_t = e^{\,\log\pi_\theta(a_t) - \log\pi_{\text{old}}(a_t)}, \qquad L = \frac{1}{N}\sum_t \min\!\big(r_t A_t,\ \operatorname{clip}(r_t, 1-\epsilon, 1+\epsilon)\,A_t\big)

Input

  • log_probsnp.ndarray (N,), log-probs of taken actions under the current policy.
  • old_log_probsnp.ndarray (N,), log-probs under the policy that collected the data.
  • advantagesnp.ndarray (N,), advantage estimates AtA_t.
  • epsilonfloat, clip range.

Output

A scalar float: the mean clipped surrogate objective.

Examples

Example 1

Input:  log_probs = old_log_probs (ratio = 1), advantages = [2.0], epsilon = 0.2
Output: 2.0

Explanation: when the ratio is exactly 1, both the clipped and unclipped terms equal 1A=21\cdot A = 2, so the objective is 2.0.

Constraints

  • The ratio is exp(log_probs - old_log_probs).
  • Take the elementwise min of the unclipped and clipped terms, then average.
  • Clip the ratio to [1-epsilon, 1+epsilon] before multiplying by the advantage.

Notes

  • For a positive advantage with a large ratio, min selects the clipped term — capping how much the objective rewards moving further; for a negative advantage with ratio > 1, the unclipped (more negative) term wins, discouraging the move.
  • The clip is what makes PPO "proximal": it removes the incentive to push the ratio beyond 1±ϵ1\pm\epsilon.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Ratio = 1 gives mean advantage
  • Positive advantage with large ratio is clipped
  • Negative advantage with ratio>1 uses the unclipped (more negative) term
  • Objective is the batch mean
  • Smaller epsilon clips more aggressively for positive advantage