Byte-level UTF-8 tokenizer
Background
The simplest tokenizer that works for any unicode text: encode the string as UTF-8 and treat each byte as its own token. Because UTF-8 bytes are always in , the vocabulary is exactly 256 entries — no "unknown token" is ever possible. This is the foundation GPT-2's byte-level BPE is built on: the BPE merges sit on top of this byte tokenizer, so even text the BPE vocabulary has never seen still survives an encode/decode round-trip through its raw bytes.
Problem statement
Implement two inverse functions:
byte_encode(text)— UTF-8-encode the string and return the bytes as a list of ints (each in ).byte_decode(ids)— the inverse: turn a list of byte ids back into the string. Decode witherrors='replace'so a truncated byte sequence falls back to the replacement character instead of raising.
Input
byte_encode(text)—text:str, arbitrary unicode (English, emoji, CJK, anything).byte_decode(ids)—ids:list[int], each in .
Output
byte_encodereturnslist[int]— the UTF-8 bytes as integers.byte_decodereturnsstr— the reconstructed text (as close as the byte sequence allows).
Examples
Example 1 — ASCII round-trip
byte_encode("Hi") -> [72, 105] # the ord() values of 'H', 'i'
byte_decode([72, 105]) -> "Hi"
Explanation: ASCII characters are a single UTF-8 byte each, equal to their code point, so encode/decode is a clean round-trip.
Example 2 — emoji takes four bytes
byte_encode("\U0001F600") -> [0xF0, 0x9F, 0x98, 0x80] # = [240, 159, 152, 128]
byte_decode([240, 159, 152, 128]) -> "\U0001F600" # 😀
Explanation: the grinning-face emoji U+1F600 is 4 bytes in UTF-8. The encoder must emit all four; the decoder must reassemble them into the single character. (CJK characters like 中 are 3 bytes each, so "中文" → 6 ids.)
Constraints
- Every output id is in — the vocabulary size is exactly 256.
- ASCII bytes equal Python's
ord()values; emoji are 4 UTF-8 bytes, CJK characters 3. byte_decodemust useerrors='replace'so a truncated multi-byte sequence (common with sliding-window LM inference) does not raise.- The empty string encodes to
[]andbyte_decode([])returns"".
Notes
- Why byte-level. It is universal — any string round-trips — but verbose: every character costs 1–4 tokens. Subword BPE compresses common byte sequences into single tokens (GPT-2 averages ~3.5 chars/token on English), trading the 256-entry simplicity for a ~50K vocabulary.
- Related. This is the byte layer beneath the BPE apply-merges and BPE train-merges problems.
This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •ASCII roundtrips: encode then decode returns the original string
- •ASCII byte ids equal Python's ord() values
- •All byte ids are in [0, 256)
- •Diagnostic: emoji uses 4 bytes (UTF-8 surrogate pair encoding)
- •Multi-byte CJK characters: roundtrip works
- •Empty string: encodes to empty list, decodes back to empty string