Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why
Signal
78
Hype
15
In three linesStudy of agreement metrics for LLM-as-Judge evaluation. Analysis of 24 recent papers shows that for binary criteria (MET/UNMET), Pearson r, Spearman ρ, Kendall τ_b, and phi are redundant. Cohen's κ alone adds information. Authors propose a reporting checklist including judgment scale, abstention handling, and confusion matrix.Read source
Your take?
Summary generated by Claude — human-verified