## Approach Seed a generator. Fill `reservoir` with the first $k$ items. For each subsequent index $i$ (starting at $k$): draw $j$ uniformly from $\{0, 1, \ldots, i\}$. If $j < k$, replace `reservoir[j]` with `stream[i]`. After the stream ends, return the reservoir as an array. ## Intuition **The problem.** You have a stream of items (e.g., log entries flowing past) whose length $n$ you don't know. You want $k$ uniformly random items from it. Constraint: you can only store $k$ items at a time (the stream is too big for memory). **Algorithm R (Vitter, 1985).** A surprisingly elegant solution: 1. The first $k$ items go straight into the reservoir. 2. For each later item $i$ (0-indexed), generate a random $j \in \{0, \ldots, i\}$. If $j < k$, replace `reservoir[j]` with `stream[i]`. That's it. Two-line algorithm. The magic: at the end, every item from the stream has probability exactly $k/n$ of being in the reservoir. **Proof sketch.** Consider any item $x$ at position $i \geq k$ in the stream. - Probability $x$ enters the reservoir on its turn: $k/(i+1)$. (We draw $j \in \{0, \ldots, i\}$, $i+1$ choices; we keep if $j < k$.) - Probability $x$ survives all subsequent items (positions $i+1, i+2, \ldots, n-1$): at each step $m \geq i+1$, $x$ survives if the new item's draw $j$ either falls outside $[0, k)$ or specifically misses $x$'s slot. The probability that $x$ is *not* overwritten at step $m$ is $1 - 1/(m+1) = m/(m+1)$. - Product: $\prod_{m=i+1}^{n-1} m/(m+1) = (i+1)/n$ (telescoping). So the probability $x$ ends up in the reservoir = $k/(i+1) \cdot (i+1)/n = k/n$. Beautifully uniform. For items in positions $< k$ (the initial fill), they're definitely in the reservoir after seeding, then have the same survival probability through later steps. Same result: $k/n$. **Why this is the right algorithm.** - **One pass over the stream.** Can't go back; can't restart. - **Bounded memory.** $O(k)$, independent of $n$. - **Don't need to know $n$.** Works whether stream is 1000 or 1 trillion. These properties make it the standard for sampling from streams: logs, sensor data, Kafka topics, anywhere data flows past you. **The "shrinking acceptance" intuition.** Early items in the stream are easy to add (just fill the slots). As the stream grows, the chance any new item replaces a slot shrinks proportionally. This balances: early items have many chances to be evicted; late items have few chances to be evicted. The product is exactly $k/n$. **Without replacement.** Each slot holds a distinct item; the algorithm doesn't allow duplicates. (Well, *if* the stream has duplicates, your sample can too, but each item-instance is unique in the reservoir.) **The $j \in \{0, \ldots, i\}$ detail.** The inclusive upper bound $i$ matters. `rng.integers(0, i+1)` gives $\{0, 1, \ldots, i\}$ — that's $i+1$ choices. If you used `rng.integers(0, i)` (giving $\{0, \ldots, i-1\}$), the probability would be $k/i$ instead of $k/(i+1)$, breaking the math. **Use cases.** - Online ML training: pick a uniform sample of recent data without buffering. - Log analysis: random sample of HTTP requests. - Distributed systems: each shard does reservoir sampling, then aggregates. - A/B testing: choose a subset of users for an experiment without knowing total user count. **Variant: Weighted reservoir sampling (A-Res).** Replace uniform sampling with priority based on weights. Probability of including item $i$ becomes proportional to its weight. Same one-pass property. ## Walkthrough 1. **Seed RNG.** ```python rng = np.random.default_rng(seed) ``` 2. **Materialise stream as list.** ```python stream = list(stream) ``` For generic iterables. If you have a true streaming source, you'd iterate without materialising; but for testing, list works. 3. **Initialise reservoir.** ```python reservoir = list(stream[:k]) ``` First $k$ items. 4. **Process subsequent items.** ```python for i in range(k, len(stream)): j = int(rng.integers(0, i + 1)) if j < k: reservoir[j] = stream[i] ``` - `rng.integers(0, i+1)` draws $j \in \{0, \ldots, i\}$ (high is exclusive in NumPy). - `int(...)` for clean indexing. - Replace only if $j$ falls into the reservoir's index range. 5. **Return.** ```python return np.array(reservoir) ``` ## Complexity - **Time:** $O(n)$ — one pass over $n$ items, constant work per item. - **Space:** $O(k)$ — just the reservoir. This is *much* better than the alternative of buffering the entire stream and shuffling, which would be $O(n)$ space. ## Reference solution ```python import numpy as np def reservoir_sample(stream, k, seed=0): rng = np.random.default_rng(seed) stream = list(stream) reservoir = list(stream[:k]) for i in range(k, len(stream)): j = int(rng.integers(0, i + 1)) if j < k: reservoir[j] = stream[i] return np.array(reservoir) ``` ## Common pitfalls - **Wrong range for $j$.** `rng.integers(0, i + 1)` gives $\{0, \ldots, i\}$. `rng.integers(0, i)` gives $\{0, \ldots, i-1\}$. The off-by-one breaks uniformity. - **Wrong acceptance probability.** Probability should be $k/(i+1)$, which equals "$j < k$ out of $i+1$ choices." Some implementations write it as "$\text{rand}() < k/(i+1)$" — equivalent but more subject to floating-point issues. - **Replacing in the wrong slot.** When $j < k$, replace `reservoir[j]`, *not* a random other slot. The specific slot to replace must be the $j$ you sampled (because that ensures correct uniformity for items already in the reservoir). - **Forgetting initial fill.** Without the first $k$ items going straight in, your reservoir has fewer than $k$ items when the stream is short. The conditional "if $j < k$" only fires for items after the first $k$. - **Looping over the whole stream from the start.** Indices $0$ to $k-1$ aren't subject to replacement logic; they're just slotted in. Start the loop at $i = k$. - **Using `rng.uniform()` instead of `rng.integers()`.** Uniform gives floats in $[0, 1)$. You can scale: `j = int(rng.uniform() * (i + 1))`, but this is slightly slower and has floating-point edge cases (probability of getting exactly $i+1$ is measure-zero but defensive code worries). - **Returning the full stream when len > k.** That's not reservoir sampling, that's "return everything." The point is to sample $k$ items. - **Not handling len ≤ k.** If the stream has fewer items than $k$, the loop never runs and we return the (short) reservoir. Correct by accident; works because `stream[:k]` is the same as `stream` for short streams. - **Buffering the full stream.** The reference materialises with `list(stream)` for simplicity. In a true streaming setting, you'd iterate without storing — `for i, item in enumerate(stream):` and avoid `len(stream)`. - **Not using `int(rng.integers(...))`.** NumPy returns `np.int64`; using it as a list index works but `int(...)` is cleaner and avoids any dtype surprises. - **Implementing rejection sampling instead.** Mistakenly thinking you should reject items based on some criterion. Reservoir is *replacement*-based, not rejection-based. - **Trying to vectorise.** Hard to vectorise efficiently because of the data dependency: each step's choice depends on the reservoir state. Just loop; it's fast enough in Python for typical use. ## Variants & extensions - **Algorithm L (Li).** Skip ahead: precompute how many items to skip before the next replacement. Faster than Algorithm R for huge streams. - **Weighted reservoir sampling (A-Res, Efraimidis-Spirakis).** Each item has a weight; sampling probability is proportional to weight. Uses an algorithm based on exponential clocks. - **Bias-corrected reservoir sampling.** When items can be repeated or have varying durations, adjust to keep uniformity in time/density. - **Distributed reservoir sampling.** Each shard runs reservoir; combine with weighted sampling proportional to shard size. - **Reservoir sampling with deletions.** Trickier; some algorithms exist but they're more involved. - **Min-hash sketches.** Different sketching primitive; gives Jaccard similarity estimates. - **Count-min sketches.** Frequency estimation; not sampling. - **Bloom filters.** Membership testing without sampling; different problem. - **Why this matters in ML.** - **Training data sampling.** Pick random rows from huge datasets without loading them all. - **Streaming RL.** Sample experience replay buffer entries without knowing total trajectory length. - **Online evaluation.** Maintain a uniform sample of recent predictions for monitoring. - **Sampling without replacement properties.** Each element appears at most once; for $k = n$, returns the whole stream. - **Memory-bounded sampling.** Reservoir is the canonical algorithm for "$k$ items, $O(k)$ memory, one pass." Streaming RAG, online statistics, etc. - **Connection to importance sampling.** Reservoir is uniform sampling; weighted variants are importance sampling adapted to streams. - **Streaming map-reduce.** Reservoir sampling parallelises naturally: each worker maintains a reservoir; combine. - **Empirical validity check.** The uniformity test in this problem runs 60000 trials and checks that empirical frequencies converge to $k/n = 0.4$ within tolerance. A great test of correctness. - **Vitter's 1985 paper.** Jeffrey Vitter, "Random Sampling with a Reservoir." The original Algorithm R + Algorithm L paper. - **In production.** Kafka clients, Spark, BigQuery, Snowflake all have built-in reservoir-style sampling for stream analytics. - **Why this is "Medium" difficulty.** Algorithm is short (5 lines), but the proof of correctness is non-trivial. Off-by-one errors in the $j$ range are a common stumbling block. Getting `stream[:k]` for the initial fill is easy to forget. - **Generalisations.** Skip ahead, weighted, biased toward freshness, etc. The standard variant (this problem) is uniform-no-replacement on a one-pass stream. - **The "infinite stream" case.** Reservoir sampling works for unbounded streams: at any point, the reservoir is a uniform sample from "everything seen so far." Stop whenever. - **Why not "save everything, shuffle, take $k$."** Trivially correct, but requires $O(n)$ space. Reservoir uses $O(k)$, often $k \ll n$. - **Why this is taught.** Captures the essence of streaming algorithms: bounded memory, one pass, probabilistic guarantees. Foundation for understanding sketching, sampling, and online algorithms. - **The "selection problem" connection.** Reservoir is related to the more general "select $k$ items from a stream with property X." Each variation has its own algorithm; reservoir is the uniform case. - **Common production gotcha.** When the stream is empty or very short ($< k$), behaviour must be well-defined. Returning whatever fits in the reservoir is the standard convention.