Unigram probability from a corpusEasy

Unigram probability from a corpus

Background

The unigram probability of a word is the simplest language model: its relative frequency in a corpus, P(w)=count(w)/NP(w) = \text{count}(w)/N. This is the maximum-likelihood estimate under a bag-of-words (unigram) model — the starting point before adding context (bigrams, n-grams) or smoothing.

Problem statement

Implement unigram_probability(corpus, word) returning the relative frequency of word:

P(w)=count(w)NP(w) = \frac{\text{count}(w)}{N}

where NN is the total number of whitespace-separated tokens. Round to 4 decimals.

Input

  • corpusstr: text, tokenised by whitespace.
  • wordstr: the target token.

Output

Returns a float in [0,1][0, 1] (rounded to 4 decimals).

Examples

Example 1

Input:  corpus = "<s> Jack I like </s> <s> Jack I do like </s>", word = "Jack"
Output: 0.1818

Explanation: the corpus has 11 tokens and "Jack" appears twice, so P(Jack)=2/110.1818P(\text{Jack}) = 2/11 \approx 0.1818.

Constraints

  • Tokenise by whitespace (split()) and count exact-match tokens.
  • P(w)=count(w)/NP(w) = \text{count}(w)/N, rounded to 4 decimals.
  • A word absent from the corpus has probability 0.

Notes

  • This MLE estimate gives probability 0 to unseen words — the motivation for smoothing (add-one/Laplace) in real language models.
  • Summed over the whole vocabulary, the unigram probabilities form a valid distribution (they total 1).
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example: 2/11
  • Absent word has probability 0
  • Probabilities over the vocabulary sum to ~1
  • A single repeated word has probability 1