arXiv cs.AI·19 May 2026

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Signal

Hype

In three linesProfBench introduces a benchmark of 7000+ response-criterion pairs evaluated by domain experts (Physics/Chemistry PhDs, Finance/Consulting MBAs). Top models like GPT-5-high achieve only 65.9% performance. Authors develop robust LLM-Judges reducing evaluation costs by 2-3 orders of magnitude.

Read source

Your take?

Benchmarks Evals GPT

Summary generated by Claude — human-verified

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Other angles on this story