METEOR scoreMedium

METEOR score

Background

METEOR scores machine translation by aligning unigrams between candidate and reference, then combining a recall-weighted F-mean with a fragmentation penalty that punishes scrambled word order. Unlike BLEU it rewards recall and ordering explicitly, correlating better with human judgement at the sentence level.

Problem statement

Implement meteor_score(reference, candidate, alpha=0.9, beta=3, gamma=0.5). Lowercase and tokenise both, count matched unigrams mm (clipped), then:

P=mcand,R=mref,Fmean=PRαP+(1α)RP = \frac{m}{|\text{cand}|}, \quad R = \frac{m}{|\text{ref}|}, \quad F_{\text{mean}} = \frac{P\,R}{\alpha P + (1-\alpha)R} Pen=γ(chunksm)β,METEOR=Fmean(1Pen)\text{Pen} = \gamma\Big(\frac{\text{chunks}}{m}\Big)^{\beta}, \qquad \text{METEOR} = F_{\text{mean}}\,(1 - \text{Pen})

where chunks is the number of contiguous runs of matched words in the candidate. Round to 3 decimals.

Input

  • referencestr.
  • candidatestr.
  • alpha, beta, gammafloat: METEOR parameters (defaults 0.9, 3, 0.5).

Output

Returns a float in [0,1][0, 1] (rounded to 3 decimals).

Examples

Example 1

Input:  meteor_score("Rain falls gently from the sky", "Gentle rain drops from the sky")
Output: 0.625

Explanation: 4 unigrams match (rain, from, the, sky); P=R=4/6P=R=4/6 gives Fmean=0.667F_{\text{mean}}=0.667. The matches form 2 chunks in the candidate, so Pen=0.5(2/4)3=0.0625\text{Pen}=0.5(2/4)^3=0.0625 and METEOR =0.667×0.9375=0.625=0.667\times0.9375=0.625.

Constraints

  • Lowercase + whitespace tokenise; match unigrams with clipping (each reference word usable once).
  • Fmean=PR/(αP+(1α)R)F_{\text{mean}} = PR/(\alpha P + (1-\alpha)R); chunks = contiguous runs of matched candidate positions.
  • Penalty =γ(chunks/m)β= \gamma(\text{chunks}/m)^\beta; final score =Fmean(1Pen)= F_{\text{mean}}(1-\text{Pen}), rounded to 3 dp.
  • Return 0 if there are no matches or either text is empty.

Notes

  • The chunk penalty is METEOR's signature: the same matched words score lower when scattered (many chunks) than when contiguous (few chunks).
  • The default α=0.9\alpha=0.9 weights recall heavily — in translation, covering the reference matters more than terseness.
Python
Loading...

This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.

  • Reference example: 0.625
  • Longer exact match scores near 1
  • No overlapping words -> 0
  • Fragmented matches score lower than contiguous (same number of matches)