Prioritized experience replay
Background
Prioritized Experience Replay (PER) improves DQN-style training by replaying informative transitions more often. Instead of sampling the replay buffer uniformly, PER samples in proportion to each transition's TD-error magnitude (how "surprising" it was). To stay unbiased, samples are drawn from priorities raised to a power , and the resulting bias is corrected by importance-sampling (IS) weights.
Problem statement
Implement per_probabilities(priorities, alpha) and is_weights(probs, N, beta):
per_probabilities returns the sampling distribution; is_weights returns the normalized IS weights (divided by their max so the largest weight is 1).
Input
priorities—np.ndarray(N,), non-negative priorities (e.g.|TD-error| + eps).alpha—floatin[0, 1], prioritization strength (0 = uniform).probs—np.ndarray(N,), the sampling probabilities fromper_probabilities.N—int, buffer size.beta—floatin[0, 1], IS correction strength.
Output
per_probabilities→np.ndarray(N,)summing to 1.is_weights→np.ndarray(N,)with max value 1.
Examples
Example 1
priorities = [1, 1, 2], alpha = 1.0
P = [0.25, 0.25, 0.5] # proportional to priorities
is_weights(P, N=3, beta=1.0) -> [1.0, 1.0, 0.5]
Explanation: with probabilities are proportional to priorities. IS weights are normalized by their max; the highest-probability sample (index 2) gets the smallest weight.
Constraints
- makes all probabilities equal (uniform sampling); higher sharpens toward high-priority items.
- IS weights use , then divide by the maximum so weights are in .
- Probabilities must sum to 1.
Notes
- and trade off: controls how much prioritization happens; controls how fully its bias is corrected (often annealed over training).
- Normalizing IS weights by their max keeps gradient magnitudes bounded, only scaling updates down, which stabilizes training.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Probabilities are proportional to priorities (alpha=1)
- •alpha=0 gives uniform probabilities
- •Probabilities sum to 1
- •IS weights example, normalized to max 1
- •Higher-probability samples get smaller IS weights