Bag-of-words encoding
Background
Bag-of-words is the most basic text vectoriser: build a vocabulary of all terms, then represent each document by the count of each vocabulary word, discarding order. It is the input to Naive Bayes, TF-IDF weighting, and classic text classifiers.
Problem statement
Implement bag_of_words(corpus) returning the count matrix for a corpus of tokenised documents. The vocabulary is the sorted set of all terms; entry is the number of times term appears in document .
Input
corpus—list[list[str]]: each inner list is a document's tokens.
Output
Returns an np.ndarray of shape (N, |V|) of integer counts, columns in sorted-vocabulary order.
Examples
Example 1
Input: corpus = [["a", "b", "a"], ["b", "c"]]
Output: vocab = ["a", "b", "c"]; M = [[2, 1, 0], [0, 1, 1]]
Explanation: doc 0 has "a" twice and "b" once; doc 1 has "b" and "c" once each. "c" never appears in doc 0, so that entry is 0.
Constraints
- Vocabulary = sorted set of all terms; columns follow that order.
- Entries are raw integer counts (not binary presence).
- Row sums to the length of document .
Notes
- Bag-of-words discards word order entirely — "dog bites man" and "man bites dog" map to identical vectors.
- It is the count matrix that TF-IDF re-weights; a binary variant (presence/absence) is the "set-of-words" model.
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example
- •Shape is (n_docs, vocab_size)
- •Each row sums to its document length
- •Counts, not binary presence