DPO loss
Background
Direct Preference Optimization (DPO) aligns a language model to human preferences without training a separate reward model or running RL. Given pairs where a "chosen" response is preferred over a "rejected" one , DPO directly increases the policy's relative log-probability of over , anchored to a frozen reference model so the policy does not drift too far.
Problem statement
Implement dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1). With per-example sequence log-probabilities, compute:
Return the mean loss over the batch.
Input
policy_chosen,policy_rejected—np.ndarray(N,), policy log-probs of the chosen/rejected responses.ref_chosen,ref_rejected—np.ndarray(N,), reference-model log-probs.beta—float, the KL/strength coefficient.
Output
A scalar float: the mean DPO loss.
Examples
Example 1
Input: policy_chosen=[-1.0], policy_rejected=[-2.0], ref_chosen=[-1.5], ref_rejected=[-2.5], beta=0.1
Output: 0.6931 (= -log sigma(0))
Explanation: the policy log-ratio is and the reference log-ratio is , so . With , the loss is .
Constraints
- subtracts the reference log-ratio from the policy log-ratio (the implicit reward difference).
- Use a numerically stable log-sigmoid (e.g.
-softplus(-x)), notlog(sigmoid(x))directly. - Return the mean over the batch.
Notes
- Minimizing the loss pushes positive — the policy prefers over more strongly than the reference does.
- controls how far the policy may move from the reference; larger makes the loss more sensitive to the log-ratio gap.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example -> log 2
- •Equal log-ratios give loss log(2)
- •A larger preference margin lowers the loss
- •Loss is the mean over the batch
- •Loss is always positive