Token embedding lookup
Background
Token embedding is the very first thing a transformer does: it turns a sequence of integer token ids into a sequence of learnable vectors. The embedding table is one big matrix with one row per vocabulary entry — and it is huge, about a third of GPT-2's 124M parameters live in this single table. The operation itself is the simplest in the whole model: a row lookup. The point of this problem is to recognise that you do not need a Python loop over the batch — NumPy fancy indexing does it in one operation.
Problem statement
Implement token_embedding(idx, weight): for every token id in idx, return the corresponding row of the embedding table. With and :
In NumPy this is exactly weight[idx] — fancy indexing broadcasts the (B, T) index array to produce a (B, T, C) result.
Input
idx—np.ndarrayof shape(B, T), integer token ids, each in .weight—np.ndarrayof shape(V, C): the embedding table; row is the vector for token id .
Output
Returns an np.ndarray of shape (B, T, C) — the embedding vector for each token in idx.
Examples
Example 1 — identity table makes the lookup obvious
Input: weight = np.eye(5), idx = [[3, 0, 1]]
Output: [[[0, 0, 0, 1, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0]]] # shape (1, 3, 5)
Explanation: with the identity table, row is the -th basis vector. Token 3 picks row 3 → [0,0,0,1,0], token 0 picks row 0, token 1 picks row 1.
Example 2 — the same id always maps to the same vector
Input: weight: a (5, 16) table, idx = [[2, 0, 2, 1, 2]]
Output: shape (1, 5, 16); out[0,0] == out[0,2] == out[0,4] == weight[2]
Explanation: token id 2 appears at positions 0, 2, and 4; each looks up the same row weight[2], so all three vectors are identical. A correct lookup guarantees this for free.
Constraints
- Indices are integers in ; the output is float, of shape
(B, T, C). - Use vectorised fancy indexing (
weight[idx]) — a Pythonforloop over(B, T)is interpreter overhead for what is a single C-level operation. - A repeated token id must yield the identical row at every occurrence (
atol=1e-12). - Must scale to GPT-2 sizes (e.g. , ) without materialising anything per-token.
Notes
- Equivalent but wasteful. A lookup equals a one-hot matmul,
np.eye(V)[idx] @ weight, but that builds a(B, T, V)tensor — fine for , fatal for . The lookup skips it entirely. - Series. This is step 1 of the build-gpt track; later steps add positional encodings, attention, layer norm, and the full transformer block.
This problem ships 5 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Output shape is (B, T, C)
- •Result matches direct row indexing weight[idx]
- •Same id at different positions returns identical vectors
- •Single batch (B=1) works
- •GPT-2-scale shapes work (vocab=50257, n_embd=64, B=2, T=8)