arXiv cs.CL·2 June 2026

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

Signal

Hype

In three linesStudy of agreement metrics for LLM-as-Judge evaluation. Analysis of 24 recent papers shows that for binary criteria (MET/UNMET), Pearson r, Spearman ρ, Kendall τ_b, and phi are redundant. Cohen's κ alone adds information. Authors propose a reporting checklist including judgment scale, abstention handling, and confusion matrix.

Read source

Your take?

Evals Benchmarks Papers

Summary generated by Claude — human-verified

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

Other angles on this story