Reservoir samplingMedium

Reservoir sampling

Background

Reservoir sampling picks k items uniformly at random from a stream whose length you do not know in advance and cannot fully store. Algorithm R keeps a k-slot "reservoir": the first k items fill it, then each later item (index i) replaces a random reservoir slot with probability k/(i+1). When the stream ends, every item has had an equal k/n chance of being in the reservoir.

Problem statement

Implement reservoir_sample(stream, k, seed=0) returning k sampled items:

  1. Fill the reservoir with the first k items.
  2. For each later index i (≥ k), draw j uniform in {0,...,i}; if j < k, set reservoir[j] = stream[i].

Use np.random.default_rng(seed). If the stream has ≤ k items, return all of them.

Input

  • stream — an iterable/sequence of items.
  • kint, reservoir size.
  • seedint, RNG seed.

Output

An np.ndarray of up to k sampled items (all elements if the stream is shorter than k).

Examples

Example 1

Input:  stream = [10, 20, 30, 40, 50], k = 2
Output: 2 items from the stream; over many seeds each item appears with probability 2/5

Explanation: the first 2 items seed the reservoir; items 3–5 each get a chance to replace a slot, balanced so all 5 items are equally likely to survive.

Constraints

  • The replacement index j is drawn from {0,...,i} inclusive; replace only when j < k.
  • One pass over the stream, storing at most k items.
  • If len(stream) <= k, return everything.

Notes

  • The magic is the shrinking acceptance probability k/(i+1): it exactly cancels the chance of later eviction so the final distribution is uniform.
  • It is the standard way to sample from logs, Kafka streams, or any data too large to hold in memory.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Returns k items, all drawn from the stream
  • Stream shorter than k returns everything
  • Samples are distinct (sampling without replacement)
  • Uniformity: each element survives with probability k/n
  • Reproducible with the same seed