Back to feed
arXiv cs.CL·

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Signal
82
Hype
18
In three linesSoohak is a 439-problem research-level math benchmark authored by 64 mathematicians. Gemini-3-Pro reaches 30.4%, GPT-5 26.4%, Claude-Opus-4.5 10.4%. The benchmark introduces a refusal subset evaluating the ability to recognize ill-posed problems: no model exceeds 50%.
Read source
Your take?
BenchmarksReasoningGPTGeminiClaude

Summary generated by Claude — human-verified