Speculative decoding verification
Background
Speculative decoding speeds up LLM generation by letting a small, fast draft model propose several tokens, which the large target model then verifies in a single parallel forward pass. The clever part is the acceptance test: it accepts or rejects each drafted token in a way that provably yields samples from the target distribution — so the output is identical in distribution to plain target sampling, just faster.
Problem statement
Implement speculative_accept(draft_tokens, p, q, r) returning the list of accepted draft tokens (the longest accepted prefix). For each drafted token :
where is the target distribution and the draft distribution at step . Stop at the first rejection and return the accepted prefix.
Input
draft_tokens— list ofinttoken ids proposed by the draft model, lengthL.p—np.ndarray(L, vocab), target probabilities at each step.q—np.ndarray(L, vocab), draft probabilities at each step.r—np.ndarray(L,), uniform random numbers in[0, 1)for the accept tests.
Output
A list[int]: the accepted prefix of draft_tokens (possibly empty, up to all L).
Examples
Example 1
draft_tokens = [5, 2, 7]
p[0,5]=0.6, q[0,5]=0.3 ratio min(1, 2.0)=1 -> r[0]=0.5 <= 1 accept
p[1,2]=0.1, q[1,2]=0.5 ratio min(1, 0.2) -> r[1]=0.5 > 0.2 reject
Output: [5]
Explanation: token 5 is accepted (the target likes it at least as much as the draft, ratio capped at 1). Token 2 is rejected because the target's probability is far below the draft's, and verification halts there.
Constraints
- The acceptance ratio is capped at 1: a token the target prefers more than the draft is always accepted (for ).
- Stop at the first rejected token; do not consider later tokens.
- Return the accepted prefix as a list (empty if the first token is rejected).
Notes
- On rejection the algorithm would resample the next token from the normalized residual ; this problem isolates the acceptance test that determines the speed-up.
- The expected number of accepted tokens per draft block governs the wall-clock speed-up — the closer the draft is to the target, the more tokens stick.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example: second token rejected
- •All tokens accepted when target dominates draft
- •First token rejected gives empty prefix
- •Ratio capped at 1 (p >= q always accepts for r <= 1)
- •Acceptance halts at the first rejection mid-sequence