Back to feed
arXiv cs.CL·

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Signal
82
Hype
15
In three linesReliability study of LLM-as-a-Judge: GPT-4o-mini and GPT-4.1-mini show significant instability with 13.6% average preference flips, 28% of questions exceeding 20% flip rate. Position bias detected (72% A-majority). Cross-judge agreement 76% (κ=0.51). 11 repeated trials needed for 95% confidence.
Read source
Your take?
EvalsGPTOpenAIBenchmarksAI safety

Summary generated by Claude — human-verified