Unigram probability from a corpusEasynlplanguage-modelsprobabilityfundamentals
Unigram probability from a corpus
Background
The unigram probability of a word is the simplest language model: its relative frequency in a corpus, . This is the maximum-likelihood estimate under a bag-of-words (unigram) model — the starting point before adding context (bigrams, n-grams) or smoothing.
Problem statement
Implement unigram_probability(corpus, word) returning the relative frequency of word:
where is the total number of whitespace-separated tokens. Round to 4 decimals.
Input
corpus—str: text, tokenised by whitespace.word—str: the target token.
Output
Returns a float in (rounded to 4 decimals).
Examples
Example 1
Input: corpus = "<s> Jack I like </s> <s> Jack I do like </s>", word = "Jack"
Output: 0.1818
Explanation: the corpus has 11 tokens and "Jack" appears twice, so .
Constraints
- Tokenise by whitespace (
split()) and count exact-match tokens. - , rounded to 4 decimals.
- A word absent from the corpus has probability 0.
Notes
- This MLE estimate gives probability 0 to unseen words — the motivation for smoothing (add-one/Laplace) in real language models.
- Summed over the whole vocabulary, the unigram probabilities form a valid distribution (they total 1).
Python
Loading...
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example: 2/11
- •Absent word has probability 0
- •Probabilities over the vocabulary sum to ~1
- •A single repeated word has probability 1