TF-IDFMedium

TF-IDF

Background

TF-IDF (term frequency-inverse document frequency) turns text into numeric features by weighting each term by how often it appears in a document (TF) against how rare it is across the corpus (IDF). Common words like "the" appear everywhere and get near-zero weight; distinctive words that pinpoint a document get high weight. It is the classic baseline representation for search, clustering, and text classification.

Problem statement

Implement tf_idf(corpus) that returns the TF-IDF matrix for a corpus of tokenised documents. With NN documents, vocabulary VV (the sorted set of all terms), term frequency, and document frequency df(t)\text{df}(t):

tf(t,d)=count(t,d)d,idf(t)=lnNdf(t)\text{tf}(t, d) = \frac{\text{count}(t, d)}{|d|}, \qquad \text{idf}(t) = \ln\frac{N}{\text{df}(t)} Md,t=tf(t,d)idf(t)M_{d,t} = \text{tf}(t, d)\cdot \text{idf}(t)

Input

  • corpuslist[list[str]]: each inner list is a document's tokens.

Output

Returns an np.ndarray of shape (N, |V|), where columns follow the sorted vocabulary order and M[d, t] is term tt's TF-IDF in document dd.

Examples

Example 1

Input:  corpus = [["a", "b"], ["a", "c"]]
Output: vocab = ["a", "b", "c"];
        M = [[0.0, 0.3466, 0.0], [0.0, 0.0, 0.3466]]

Explanation: "a" appears in both documents, so idf("a")=ln(2/2)=0\text{idf}("a") = \ln(2/2) = 0 and its column is all zeros. "b" appears in only doc 0: tf=0.5\text{tf} = 0.5, idf=ln2\text{idf} = \ln 2, giving 0.5ln20.34660.5\ln 2 \approx 0.3466.

Constraints

  • Vocabulary is the sorted set of all terms; columns follow that order.
  • tf=\text{tf} = count in document / document length; idf=ln(N/df(t))\text{idf} = \ln(N / \text{df}(t)) (natural log).
  • A term present in every document has idf=0\text{idf} = 0.
  • Tests compare with atol=1e-6.

Notes

  • IDF is what down-weights ubiquitous words: ln(N/df)\ln(N/\text{df}) shrinks to 0 as a term approaches appearing in all documents.
  • Real implementations often smooth IDF (e.g. ln(N/(1+df))+1\ln(N/(1+\text{df})) + 1) and L2-normalise rows; this is the unsmoothed textbook form.
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • A term appearing in every document gets zero tf-idf
  • Matrix shape is (n_docs, vocab_size)
  • Reference example values
  • Rarer terms score higher than common ones at equal term frequency