Train BPE merges from a corpus
Background
This is the training half of Byte-Pair Encoding — the algorithm GPT-2 used to build its 50,257-token vocabulary, and a classic "implement BPE" interview question. Training learns an ordered list of merges from a corpus: repeatedly find the most frequent adjacent token pair and glue it into a new token. It is the companion to applying merges (encoding new text): train once to freeze the merge list, then apply that list forever after.
Problem statement
Implement train_bpe(corpus, num_merges). The corpus is a list of words, each a list of base tokens (single characters/bytes). Repeat up to num_merges rounds:
- Count every adjacent pair of tokens across the whole corpus.
- Pick the most frequent pair. On a tie, break it by lexicographic order on the pair tuple — i.e. choose the smallest pair.
- Merge that pair everywhere: replace each adjacent
[a, b]with the single tokena + bin every word. - Record the merged pair.
Stop after num_merges rounds, or earlier if no adjacent pairs remain. Return the learned merges in the order they were chosen.
Input
corpus—list[list[str]]: a list of words, each a list of base tokens. The function may mutate this corpus.num_merges—int: the maximum number of merges to learn.
Output
Returns list[tuple[str, str]] — the learned merges, in selection order. This is exactly the merges list that the encoder (apply-merges) consumes.
Examples
Example 1 — three rounds, with a tie in round 2
Input: corpus=[["a", "b", "c", "a", "b"]], num_merges=3
Output: [("a", "b"), ("ab", "c"), ("abc", "ab")]
Explanation, round by round:
| Round | Pair counts | Winner | Corpus after |
|---|---|---|---|
| 1 | (a,b)=2, (b,c)=1, (c,a)=1 | (a,b) (most frequent) | [["ab","c","ab"]] |
| 2 | (ab,c)=1, (c,ab)=1 | (ab,c) (tie → lex: "ab" < "c") | [["abc","ab"]] |
| 3 | (abc,ab)=1 | (abc,ab) | [["abcab"]] |
Example 2 — lexicographic tie-breaking
Input: corpus=[["a", "b"], ["a", "c"]], num_merges=1
Output: [("a", "b")]
Explanation: (a,b) and (a,c) each occur once. The tie breaks lexicographically and ("a","b") < ("a","c"), so (a,b) wins.
Example 3 — frequency beats lexicographic order
Input: corpus=[["a", "b"], ["x", "y", "x", "y", "x", "y"]], num_merges=1
Output: [("x", "y")]
Explanation: (a,b) occurs once but (x,y) occurs three times. Most-frequent wins outright; lex order is only a tie-breaker, so the rarer-but-earlier (a,b) does not win.
Constraints
- Pair counts are taken across the entire corpus afresh each round (the counts change after every merge).
- Selection rule: maximise frequency; among pairs tied for the max, choose the lexicographically smallest
(a, b)tuple (deterministic — equivalent tominover the tied pairs). - Apply the merge with a single left-to-right pass per word (
word[i:i+2] == (a, b)→ emita+b, advance 2; else emitword[i], advance 1). Do not usestr.replace, which ignores token boundaries and corrupts results. - Stop after
num_mergesrounds or as soon as no adjacent pair exists (e.g. all words are single tokens) — then return however many merges were learned (possibly an empty list).
Notes
- Why a fixed tie-break?
Counter.most_common(1)breaks ties by insertion order, which is corpus-dependent and not reproducible across implementations. Freezing lexicographic order makes the trained vocabulary deterministic. - Companion problem. The output here is the input to BPE encoding — train to freeze the ordered merge list, then feed it to the apply-merges step to tokenise any new text.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Returns a list of merges with the expected length
- •Diagnostic: simple corpus produces expected merge sequence
- •Stops early when no adjacent pairs remain
- •Lexicographic tie-breaking on pair tuples
- •Picks the most-frequent pair (vs a more-lexicographically-early but rarer one)
- •Realistic corpus: 'low'/'lower'/'newer' produces sensible merges