TF-IDF
Background
TF-IDF (term frequency-inverse document frequency) turns text into numeric features by weighting each term by how often it appears in a document (TF) against how rare it is across the corpus (IDF). Common words like "the" appear everywhere and get near-zero weight; distinctive words that pinpoint a document get high weight. It is the classic baseline representation for search, clustering, and text classification.
Problem statement
Implement tf_idf(corpus) that returns the TF-IDF matrix for a corpus of tokenised documents. With documents, vocabulary (the sorted set of all terms), term frequency, and document frequency :
Input
corpus—list[list[str]]: each inner list is a document's tokens.
Output
Returns an np.ndarray of shape (N, |V|), where columns follow the sorted vocabulary order and M[d, t] is term 's TF-IDF in document .
Examples
Example 1
Input: corpus = [["a", "b"], ["a", "c"]]
Output: vocab = ["a", "b", "c"];
M = [[0.0, 0.3466, 0.0], [0.0, 0.0, 0.3466]]
Explanation: "a" appears in both documents, so and its column is all zeros. "b" appears in only doc 0: , , giving .
Constraints
- Vocabulary is the sorted set of all terms; columns follow that order.
- count in document / document length; (natural log).
- A term present in every document has .
- Tests compare with
atol=1e-6.
Notes
- IDF is what down-weights ubiquitous words: shrinks to 0 as a term approaches appearing in all documents.
- Real implementations often smooth IDF (e.g. ) and L2-normalise rows; this is the unsmoothed textbook form.
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •A term appearing in every document gets zero tf-idf
- •Matrix shape is (n_docs, vocab_size)
- •Reference example values
- •Rarer terms score higher than common ones at equal term frequency