ROUGE-1 score
Background
ROUGE is the standard recall-oriented metric for summarisation: it measures how much of a reference summary's content appears in a generated one. ROUGE-1 counts overlapping unigrams and reports precision, recall, and their F1 — the complement to BLEU, which is precision-oriented for translation.
Problem statement
Implement rouge_1_score(reference, candidate) returning a dict with precision, recall, and f1 for unigram overlap. Lowercase and whitespace-tokenise both texts, count the clipped unigram overlap, then:
where .
Input
reference—str: the reference text.candidate—str: the generated text.
Output
Returns a dict with float values under keys "precision", "recall", "f1" (each in ).
Examples
Example 1
Input: reference = "the cat sat on the mat", candidate = "the cat is on the mat"
Output: {"precision": 0.8333, "recall": 0.8333, "f1": 0.8333}
Explanation: clipped overlap is the(2) + cat(1) + on(1) + mat(1) = 5; both texts have 6 tokens, so and F1 .
Constraints
- Lowercase and split on whitespace.
- Overlap is clipped by per-word counts in both texts.
- Guard divide-by-zero: any metric whose denominator is 0 returns .
Notes
- ROUGE-1 precision is essentially BLEU's unigram precision; ROUGE adds the recall side (did we cover the reference?), which is what matters for summarisation.
- ROUGE-L (longest common subsequence) and ROUGE-2 (bigrams) extend this; ROUGE-1 is the simplest, most common baseline.
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example: all 5/6
- •Identical texts -> all 1.0
- •No overlap -> all zeros
- •Precision and recall differ with different lengths