Cohen's kappa score
Background
Cohen's kappa () measures agreement between two raters — or a prediction against ground truth — corrected for the agreement expected by chance. Two raters labelling at random still agree sometimes, so raw accuracy overstates real agreement; subtracts off that chance baseline. is perfect agreement, is chance-level, and negative means worse than chance. It is the honest metric for imbalanced or multi-rater labelling.
Problem statement
Implement cohens_kappa(y1, y2) for two label sequences:
where is the observed agreement (fraction of matching labels) and is the chance agreement , with the fraction of rater 's labels equal to class .
Input
y1— array-like of labels from rater 1.y2— array-like of labels from rater 2, the same length.
Output
Returns a float (typically in ; = perfect agreement).
Examples
Example 1
Input: y1 = [1, 0, 1, 1, 0], y2 = [1, 0, 0, 1, 0]
Output: 0.6154
Explanation: observed agreement . Chance agreement , so .
Constraints
- is the fraction of positions where the two labels match.
- over all classes that appear in either sequence.
- If (both raters constant on the same class), return to guard the .
Notes
- can be negative when raters agree less often than chance would predict.
- The chance correction is what makes informative on imbalanced data, where plain accuracy looks high simply because one class dominates.
This problem ships 4 hidden tests. They run in your browser via Pyodide — no backend, no submission queue. Press ▶ Run tests to execute.
- •Reference example: kappa = 0.6154
- •Perfect agreement -> 1.0
- •Symmetric in the two raters
- •Chance-level agreement gives kappa = 0