arXiv cs.CL·15 June 2026

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Signal

Hype

In three linesReliability study of LLM-as-a-Judge: GPT-4o-mini and GPT-4.1-mini show significant instability with 13.6% average preference flips, 28% of questions exceeding 20% flip rate. Position bias detected (72% A-majority). Cross-judge agreement 76% (κ=0.51). 11 repeated trials needed for 95% confidence.

Read source

Your take?

Evals GPT OpenAI Benchmarks AI safety

Summary generated by Claude — human-verified

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Other angles on this story