arXiv cs.CL·19 May 2026

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Signal

Hype

In three linesSoohak is a 439-problem research-level math benchmark authored by 64 mathematicians. Gemini-3-Pro reaches 30.4%, GPT-5 26.4%, Claude-Opus-4.5 10.4%. The benchmark introduces a refusal subset evaluating the ability to recognize ill-posed problems: no model exceeds 50%.

Read source

Your take?

Benchmarks Reasoning GPT Gemini Claude

Summary generated by Claude — human-verified

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Other angles on this story