Bag-of-words encodingEasy

Bag-of-words encoding

Background

Bag-of-words is the most basic text vectoriser: build a vocabulary of all terms, then represent each document by the count of each vocabulary word, discarding order. It is the input to Naive Bayes, TF-IDF weighting, and classic text classifiers.

Problem statement

Implement bag_of_words(corpus) returning the count matrix for a corpus of tokenised documents. The vocabulary is the sorted set of all terms; entry (d,t)(d, t) is the number of times term tt appears in document dd.

Input

  • corpuslist[list[str]]: each inner list is a document's tokens.

Output

Returns an np.ndarray of shape (N, |V|) of integer counts, columns in sorted-vocabulary order.

Examples

Example 1

Input:  corpus = [["a", "b", "a"], ["b", "c"]]
Output: vocab = ["a", "b", "c"]; M = [[2, 1, 0], [0, 1, 1]]

Explanation: doc 0 has "a" twice and "b" once; doc 1 has "b" and "c" once each. "c" never appears in doc 0, so that entry is 0.

Constraints

  • Vocabulary = sorted set of all terms; columns follow that order.
  • Entries are raw integer counts (not binary presence).
  • Row dd sums to the length of document dd.

Notes

  • Bag-of-words discards word order entirely — "dog bites man" and "man bites dog" map to identical vectors.
  • It is the count matrix that TF-IDF re-weights; a binary variant (presence/absence) is the "set-of-words" model.
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example
  • Shape is (n_docs, vocab_size)
  • Each row sums to its document length
  • Counts, not binary presence