The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation
Signal
82
Hype
15
In three linesReliability study of LLM-as-a-Judge: GPT-4o-mini and GPT-4.1-mini show significant instability with 13.6% average preference flips, 28% of questions exceeding 20% flip rate. Position bias detected (72% A-majority). Cross-judge agreement 76% (κ=0.51). 11 repeated trials needed for 95% confidence.Read source
Your take?
Summary generated by Claude — human-verified