Speculative decoding verificationMedium

Speculative decoding verification

Background

Speculative decoding speeds up LLM generation by letting a small, fast draft model propose several tokens, which the large target model then verifies in a single parallel forward pass. The clever part is the acceptance test: it accepts or rejects each drafted token in a way that provably yields samples from the target distribution — so the output is identical in distribution to plain target sampling, just faster.

Problem statement

Implement speculative_accept(draft_tokens, p, q, r) returning the list of accepted draft tokens (the longest accepted prefix). For each drafted token tit_i:

accept if rimin ⁣(1, pi(ti)qi(ti))\text{accept if } \quad r_i \le \min\!\Big(1,\ \frac{p_i(t_i)}{q_i(t_i)}\Big)

where pip_i is the target distribution and qiq_i the draft distribution at step ii. Stop at the first rejection and return the accepted prefix.

Input

  • draft_tokens — list of int token ids proposed by the draft model, length L.
  • pnp.ndarray (L, vocab), target probabilities at each step.
  • qnp.ndarray (L, vocab), draft probabilities at each step.
  • rnp.ndarray (L,), uniform random numbers in [0, 1) for the accept tests.

Output

A list[int]: the accepted prefix of draft_tokens (possibly empty, up to all L).

Examples

Example 1

draft_tokens = [5, 2, 7]
p[0,5]=0.6, q[0,5]=0.3   ratio min(1, 2.0)=1  -> r[0]=0.5 <= 1  accept
p[1,2]=0.1, q[1,2]=0.5   ratio min(1, 0.2)    -> r[1]=0.5 > 0.2 reject
Output: [5]

Explanation: token 5 is accepted (the target likes it at least as much as the draft, ratio capped at 1). Token 2 is rejected because the target's probability is far below the draft's, and verification halts there.

Constraints

  • The acceptance ratio is capped at 1: a token the target prefers more than the draft is always accepted (for ri1r_i \le 1).
  • Stop at the first rejected token; do not consider later tokens.
  • Return the accepted prefix as a list (empty if the first token is rejected).

Notes

  • On rejection the algorithm would resample the next token from the normalized residual (piqi)+(p_i - q_i)_+; this problem isolates the acceptance test that determines the speed-up.
  • The expected number of accepted tokens per draft block governs the wall-clock speed-up — the closer the draft is to the target, the more tokens stick.
Python
Loading...

This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example: second token rejected
  • All tokens accepted when target dominates draft
  • First token rejected gives empty prefix
  • Ratio capped at 1 (p >= q always accepts for r <= 1)
  • Acceptance halts at the first rejection mid-sequence