ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
Signal
82
Hype
18
In three linesProfBench introduces a benchmark of 7000+ response-criterion pairs evaluated by domain experts (Physics/Chemistry PhDs, Finance/Consulting MBAs). Top models like GPT-5-high achieve only 65.9% performance. Authors develop robust LLM-Judges reducing evaluation costs by 2-3 orders of magnitude.Read source
Your take?
Summary generated by Claude — human-verified