BPE: apply mergesMedium

BPE: apply merges

Background

Byte-pair encoding (BPE) is the tokeniser behind GPT-2 and most modern language models. Training a BPE tokeniser learns an ordered list of merges — adjacent symbol pairs to glue together, most-frequent first. Inference takes a string already split into single-character (or single-byte) tokens and repeatedly applies those merges until none fit, producing the final token sequence the model actually sees. This problem is the inference half: given the learned merges, apply them. (Learning the merges is a separate problem.)

Problem statement

Implement apply_merges(tokens, merges). A merge (a, b) means "wherever token a is immediately followed by token b, replace the pair with the single token a + b." The merges are given in priority order — earlier in the list means higher priority (its index is its rank; rank 0 wins). Repeat:

  1. Scan the current token list for every adjacent pair.
  2. Among the adjacent pairs that appear in merges, pick the one with the lowest rank (highest priority). If none appear, stop.
  3. Merge that pair into one token, then go back to step 1 on the updated list.

The subtlety is step 2: you re-scan the evolving token list each iteration and pick the globally highest-priority pair — you do not walk the merges list in order applying each one fully. One merge can create the pair that a later merge needs.

Input

  • tokenslist[str], the starting tokens, e.g. ["l", "o", "w", "</w>"].
  • mergeslist[tuple[str, str]], the merge rules in priority order (index 0 = highest priority).

Output

Returns a list[str]: the token list after every applicable merge has fired and no adjacent pair remains in merges.

Examples

Example 1 — cascading merges collapse to one token

Input:  tokens=["l", "o", "w"], merges=[("l", "o"), ("lo", "w")]
Output: ["low"]

Explanation: ("l", "o") (rank 0) fires → ["lo", "w"]. Now ("lo", "w") (rank 1) is adjacent and in merges, so it fires → ["low"]. No adjacent pair remains, so stop.

Example 2 — priority decides, and changes the outcome

Input:  tokens=["l", "o", "w"], merges=[("l", "o"), ("o", "w")]
Output: ["lo", "w"]

Explanation: both ("l", "o") and ("o", "w") are adjacent. ("l", "o") has the lower rank, so it wins → ["lo", "w"]. The only remaining pair is ("lo", "w"), which is not in merges, so we stop. Applying ("o", "w") first would have given the wrong ["l", "ow"].

Example 3 — a merge enables a later one

Input:  tokens=["e", "r", "</w>"], merges=[("e", "r"), ("er", "</w>")]
Output: ["er</w>"]

Explanation: ("e", "r") fires → ["er", "</w>"], which creates the pair ("er", "</w>") that the rank-1 merge then consumes → ["er</w>"].

Constraints

  • A merge's priority is its index in merges (earlier = higher priority); ranks are unique per pair.
  • Each iteration re-scans the current token list and applies the single globally-highest-priority adjacent pair — not the merges in list order.
  • If the highest-priority pair occurs more than once, merge the leftmost occurrence first (then re-scan).
  • Stop when no adjacent pair appears in merges. An empty merges list, a single-token input, or tokens with no matching pair are all no-ops that return the input unchanged.
  • Tokens are arbitrary strings; merging concatenates them (a + b), so end-of-word markers like "</w>" are treated as ordinary tokens.

Notes

  • Why not list order? Walking merges and applying each fully (all (l,o), then all (lo,w), …) breaks when the highest-priority applicable pair changes as the token list evolves — exactly the case Example 2 and Example 3 probe.
  • A clean implementation maps each pair to its rank once ({pair: i}) and, on every pass, selects the minimum-rank adjacent pair — the same greedy, priority-driven loop GPT-2's tokeniser runs.
Python
Loading...

This problem ships 8 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • No merges → input unchanged
  • Single merge applies once
  • Cascading merges produce a single token
  • Priority is respected — earlier merge fires first
  • Multiple non-overlapping merges in one input
  • Merge that creates a new merge candidate
  • No applicable merge stops the loop cleanly
  • Single-token input is a no-op