Train BPE merges from a corpusHard

Train BPE merges from a corpus

Background

This is the training half of Byte-Pair Encoding — the algorithm GPT-2 used to build its 50,257-token vocabulary, and a classic "implement BPE" interview question. Training learns an ordered list of merges from a corpus: repeatedly find the most frequent adjacent token pair and glue it into a new token. It is the companion to applying merges (encoding new text): train once to freeze the merge list, then apply that list forever after.

Problem statement

Implement train_bpe(corpus, num_merges). The corpus is a list of words, each a list of base tokens (single characters/bytes). Repeat up to num_merges rounds:

  1. Count every adjacent pair of tokens across the whole corpus.
  2. Pick the most frequent pair. On a tie, break it by lexicographic order on the pair tuple (a,b)(a, b) — i.e. choose the smallest pair.
  3. Merge that pair everywhere: replace each adjacent [a, b] with the single token a + b in every word.
  4. Record the merged pair.

Stop after num_merges rounds, or earlier if no adjacent pairs remain. Return the learned merges in the order they were chosen.

Input

  • corpuslist[list[str]]: a list of words, each a list of base tokens. The function may mutate this corpus.
  • num_mergesint: the maximum number of merges to learn.

Output

Returns list[tuple[str, str]] — the learned merges, in selection order. This is exactly the merges list that the encoder (apply-merges) consumes.

Examples

Example 1 — three rounds, with a tie in round 2

Input:  corpus=[["a", "b", "c", "a", "b"]], num_merges=3
Output: [("a", "b"), ("ab", "c"), ("abc", "ab")]

Explanation, round by round:

RoundPair countsWinnerCorpus after
1(a,b)=2, (b,c)=1, (c,a)=1(a,b) (most frequent)[["ab","c","ab"]]
2(ab,c)=1, (c,ab)=1(ab,c) (tie → lex: "ab" < "c")[["abc","ab"]]
3(abc,ab)=1(abc,ab)[["abcab"]]

Example 2 — lexicographic tie-breaking

Input:  corpus=[["a", "b"], ["a", "c"]], num_merges=1
Output: [("a", "b")]

Explanation: (a,b) and (a,c) each occur once. The tie breaks lexicographically and ("a","b") < ("a","c"), so (a,b) wins.

Example 3 — frequency beats lexicographic order

Input:  corpus=[["a", "b"], ["x", "y", "x", "y", "x", "y"]], num_merges=1
Output: [("x", "y")]

Explanation: (a,b) occurs once but (x,y) occurs three times. Most-frequent wins outright; lex order is only a tie-breaker, so the rarer-but-earlier (a,b) does not win.

Constraints

  • Pair counts are taken across the entire corpus afresh each round (the counts change after every merge).
  • Selection rule: maximise frequency; among pairs tied for the max, choose the lexicographically smallest (a, b) tuple (deterministic — equivalent to min over the tied pairs).
  • Apply the merge with a single left-to-right pass per word (word[i:i+2] == (a, b) → emit a+b, advance 2; else emit word[i], advance 1). Do not use str.replace, which ignores token boundaries and corrupts results.
  • Stop after num_merges rounds or as soon as no adjacent pair exists (e.g. all words are single tokens) — then return however many merges were learned (possibly an empty list).

Notes

  • Why a fixed tie-break? Counter.most_common(1) breaks ties by insertion order, which is corpus-dependent and not reproducible across implementations. Freezing lexicographic order makes the trained vocabulary deterministic.
  • Companion problem. The output here is the input to BPE encoding — train to freeze the ordered merge list, then feed it to the apply-merges step to tokenise any new text.
Python
Loading...

This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Returns a list of merges with the expected length
  • Diagnostic: simple corpus produces expected merge sequence
  • Stops early when no adjacent pairs remain
  • Lexicographic tie-breaking on pair tuples
  • Picks the most-frequent pair (vs a more-lexicographically-early but rarer one)
  • Realistic corpus: 'low'/'lower'/'newer' produces sensible merges