Back to feed
arXiv cs.AI·

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Signal
82
Hype
18
In three linesProfBench introduces a benchmark of 7000+ response-criterion pairs evaluated by domain experts (Physics/Chemistry PhDs, Finance/Consulting MBAs). Top models like GPT-5-high achieve only 65.9% performance. Authors develop robust LLM-Judges reducing evaluation costs by 2-3 orders of magnitude.
Read source
Your take?
BenchmarksEvalsGPT

Summary generated by Claude — human-verified