arXiv cs.CL·1 June 2026

Auditing LLM Benchmarks with Item Response Theory

Signal

Hype

In three linesItem Response Theory-based method detects mislabels in 7 LLM benchmarks at 95% precision on top 200 examples across 114 models. Analysis reveals errors from mechanical labeling heuristics, inherited annotation mistakes, and fundamentally ambiguous items. Reward models specialize in stylistic preference over factual knowledge; one frontier model agrees with detected mislabels at 78% accuracy versus 38% for peers.

Read source

Your take?

Benchmarks Evals Papers

Summary generated by Claude — human-verified

Auditing LLM Benchmarks with Item Response Theory

Other angles on this story