Prioritized experience replayMedium

Prioritized experience replay

Background

Prioritized Experience Replay (PER) improves DQN-style training by replaying informative transitions more often. Instead of sampling the replay buffer uniformly, PER samples in proportion to each transition's TD-error magnitude (how "surprising" it was). To stay unbiased, samples are drawn from priorities raised to a power α\alpha, and the resulting bias is corrected by importance-sampling (IS) weights.

Problem statement

Implement per_probabilities(priorities, alpha) and is_weights(probs, N, beta):

P(i)=piαjpjα,wi=(NP(i))βmaxj(NP(j))βP(i) = \frac{p_i^{\alpha}}{\sum_j p_j^{\alpha}}, \qquad w_i = \frac{(N\cdot P(i))^{-\beta}}{\max_j (N\cdot P(j))^{-\beta}}

per_probabilities returns the sampling distribution; is_weights returns the normalized IS weights (divided by their max so the largest weight is 1).

Input

  • prioritiesnp.ndarray (N,), non-negative priorities (e.g. |TD-error| + eps).
  • alphafloat in [0, 1], prioritization strength (0 = uniform).
  • probsnp.ndarray (N,), the sampling probabilities from per_probabilities.
  • Nint, buffer size.
  • betafloat in [0, 1], IS correction strength.

Output

  • per_probabilitiesnp.ndarray (N,) summing to 1.
  • is_weightsnp.ndarray (N,) with max value 1.

Examples

Example 1

priorities = [1, 1, 2], alpha = 1.0
P = [0.25, 0.25, 0.5]            # proportional to priorities
is_weights(P, N=3, beta=1.0) -> [1.0, 1.0, 0.5]

Explanation: with α=1\alpha=1 probabilities are proportional to priorities. IS weights are (NP)β(N P)^{-\beta} normalized by their max; the highest-probability sample (index 2) gets the smallest weight.

Constraints

  • α=0\alpha=0 makes all probabilities equal (uniform sampling); higher α\alpha sharpens toward high-priority items.
  • IS weights use (NP)β(N\cdot P)^{-\beta}, then divide by the maximum so weights are in (0,1](0, 1].
  • Probabilities must sum to 1.

Notes

  • α\alpha and β\beta trade off: α\alpha controls how much prioritization happens; β\beta controls how fully its bias is corrected (often annealed β1\beta\to1 over training).
  • Normalizing IS weights by their max keeps gradient magnitudes bounded, only scaling updates down, which stabilizes training.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Probabilities are proportional to priorities (alpha=1)
  • alpha=0 gives uniform probabilities
  • Probabilities sum to 1
  • IS weights example, normalized to max 1
  • Higher-probability samples get smaller IS weights