Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Signal
82
Hype
18
In three linesSoohak is a 439-problem research-level math benchmark authored by 64 mathematicians. Gemini-3-Pro reaches 30.4%, GPT-5 26.4%, Claude-Opus-4.5 10.4%. The benchmark introduces a refusal subset evaluating the ability to recognize ill-posed problems: no model exceeds 50%.Read source
Your take?
Summary generated by Claude — human-verified