Auditing LLM Benchmarks with Item Response Theory
Signal
78
Hype
15
In three linesItem Response Theory-based method detects mislabels in 7 LLM benchmarks at 95% precision on top 200 examples across 114 models. Analysis reveals errors from mechanical labeling heuristics, inherited annotation mistakes, and fundamentally ambiguous items. Reward models specialize in stylistic preference over factual knowledge; one frontier model agrees with detected mislabels at 78% accuracy versus 38% for peers.Read source
Your take?
Summary generated by Claude — human-verified