BPE: apply merges
Background
Byte-pair encoding (BPE) is the tokeniser behind GPT-2 and most modern language models. Training a BPE tokeniser learns an ordered list of merges — adjacent symbol pairs to glue together, most-frequent first. Inference takes a string already split into single-character (or single-byte) tokens and repeatedly applies those merges until none fit, producing the final token sequence the model actually sees. This problem is the inference half: given the learned merges, apply them. (Learning the merges is a separate problem.)
Problem statement
Implement apply_merges(tokens, merges). A merge (a, b) means "wherever token a is immediately followed by token b, replace the pair with the single token a + b." The merges are given in priority order — earlier in the list means higher priority (its index is its rank; rank 0 wins). Repeat:
- Scan the current token list for every adjacent pair.
- Among the adjacent pairs that appear in
merges, pick the one with the lowest rank (highest priority). If none appear, stop. - Merge that pair into one token, then go back to step 1 on the updated list.
The subtlety is step 2: you re-scan the evolving token list each iteration and pick the globally highest-priority pair — you do not walk the merges list in order applying each one fully. One merge can create the pair that a later merge needs.
Input
tokens—list[str], the starting tokens, e.g.["l", "o", "w", "</w>"].merges—list[tuple[str, str]], the merge rules in priority order (index0= highest priority).
Output
Returns a list[str]: the token list after every applicable merge has fired and no adjacent pair remains in merges.
Examples
Example 1 — cascading merges collapse to one token
Input: tokens=["l", "o", "w"], merges=[("l", "o"), ("lo", "w")]
Output: ["low"]
Explanation: ("l", "o") (rank 0) fires → ["lo", "w"]. Now ("lo", "w") (rank 1) is adjacent and in merges, so it fires → ["low"]. No adjacent pair remains, so stop.
Example 2 — priority decides, and changes the outcome
Input: tokens=["l", "o", "w"], merges=[("l", "o"), ("o", "w")]
Output: ["lo", "w"]
Explanation: both ("l", "o") and ("o", "w") are adjacent. ("l", "o") has the lower rank, so it wins → ["lo", "w"]. The only remaining pair is ("lo", "w"), which is not in merges, so we stop. Applying ("o", "w") first would have given the wrong ["l", "ow"].
Example 3 — a merge enables a later one
Input: tokens=["e", "r", "</w>"], merges=[("e", "r"), ("er", "</w>")]
Output: ["er</w>"]
Explanation: ("e", "r") fires → ["er", "</w>"], which creates the pair ("er", "</w>") that the rank-1 merge then consumes → ["er</w>"].
Constraints
- A merge's priority is its index in
merges(earlier = higher priority); ranks are unique per pair. - Each iteration re-scans the current token list and applies the single globally-highest-priority adjacent pair — not the merges in list order.
- If the highest-priority pair occurs more than once, merge the leftmost occurrence first (then re-scan).
- Stop when no adjacent pair appears in
merges. An emptymergeslist, a single-token input, or tokens with no matching pair are all no-ops that return the input unchanged. - Tokens are arbitrary strings; merging concatenates them (
a + b), so end-of-word markers like"</w>"are treated as ordinary tokens.
Notes
- Why not list order? Walking
mergesand applying each fully (all(l,o), then all(lo,w), …) breaks when the highest-priority applicable pair changes as the token list evolves — exactly the case Example 2 and Example 3 probe. - A clean implementation maps each pair to its rank once (
{pair: i}) and, on every pass, selects the minimum-rank adjacent pair — the same greedy, priority-driven loop GPT-2's tokeniser runs.
This problem ships 8 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •No merges → input unchanged
- •Single merge applies once
- •Cascading merges produce a single token
- •Priority is respected — earlier merge fires first
- •Multiple non-overlapping merges in one input
- •Merge that creates a new merge candidate
- •No applicable merge stops the loop cleanly
- •Single-token input is a no-op