Byte-level UTF-8 tokenizerMedium

Byte-level UTF-8 tokenizer

Background

The simplest tokenizer that works for any unicode text: encode the string as UTF-8 and treat each byte as its own token. Because UTF-8 bytes are always in [0,256)[0, 256), the vocabulary is exactly 256 entries — no "unknown token" is ever possible. This is the foundation GPT-2's byte-level BPE is built on: the BPE merges sit on top of this byte tokenizer, so even text the BPE vocabulary has never seen still survives an encode/decode round-trip through its raw bytes.

Problem statement

Implement two inverse functions:

  1. byte_encode(text) — UTF-8-encode the string and return the bytes as a list of ints (each in [0,256)[0, 256)).
  2. byte_decode(ids) — the inverse: turn a list of byte ids back into the string. Decode with errors='replace' so a truncated byte sequence falls back to the replacement character instead of raising.

Input

  • byte_encode(text)text: str, arbitrary unicode (English, emoji, CJK, anything).
  • byte_decode(ids)ids: list[int], each in [0,256)[0, 256).

Output

  • byte_encode returns list[int] — the UTF-8 bytes as integers.
  • byte_decode returns str — the reconstructed text (as close as the byte sequence allows).

Examples

Example 1 — ASCII round-trip

byte_encode("Hi")        -> [72, 105]      # the ord() values of 'H', 'i'
byte_decode([72, 105])   -> "Hi"

Explanation: ASCII characters are a single UTF-8 byte each, equal to their code point, so encode/decode is a clean round-trip.

Example 2 — emoji takes four bytes

byte_encode("\U0001F600")  -> [0xF0, 0x9F, 0x98, 0x80]   # = [240, 159, 152, 128]
byte_decode([240, 159, 152, 128]) -> "\U0001F600"        # 😀

Explanation: the grinning-face emoji U+1F600 is 4 bytes in UTF-8. The encoder must emit all four; the decoder must reassemble them into the single character. (CJK characters like are 3 bytes each, so "中文" → 6 ids.)

Constraints

  • Every output id is in [0,256)[0, 256) — the vocabulary size is exactly 256.
  • ASCII bytes equal Python's ord() values; emoji are 4 UTF-8 bytes, CJK characters 3.
  • byte_decode must use errors='replace' so a truncated multi-byte sequence (common with sliding-window LM inference) does not raise.
  • The empty string encodes to [] and byte_decode([]) returns "".

Notes

  • Why byte-level. It is universal — any string round-trips — but verbose: every character costs 1–4 tokens. Subword BPE compresses common byte sequences into single tokens (GPT-2 averages ~3.5 chars/token on English), trading the 256-entry simplicity for a ~50K vocabulary.
  • Related. This is the byte layer beneath the BPE apply-merges and BPE train-merges problems.
Python
Loading...

This problem ships 6 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • ASCII roundtrips: encode then decode returns the original string
  • ASCII byte ids equal Python's ord() values
  • All byte ids are in [0, 256)
  • Diagnostic: emoji uses 4 bytes (UTF-8 surrogate pair encoding)
  • Multi-byte CJK characters: roundtrip works
  • Empty string: encodes to empty list, decodes back to empty string