Back to feed
arXiv cs.CL·

Auditing LLM Benchmarks with Item Response Theory

Signal
78
Hype
15
In three linesItem Response Theory-based method detects mislabels in 7 LLM benchmarks at 95% precision on top 200 examples across 114 models. Analysis reveals errors from mechanical labeling heuristics, inherited annotation mistakes, and fundamentally ambiguous items. Reward models specialize in stylistic preference over factual knowledge; one frontier model agrees with detected mislabels at 78% accuracy versus 38% for peers.
Read source
Your take?
BenchmarksEvalsPapers

Summary generated by Claude — human-verified