arXiv cs.CL·20 May 2026

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

Signal

Hype

In three linesSurvey of ~120 studies on mathematical reasoning in LLMs. Structured analysis of datasets, architectures, training strategies, and evaluation protocols. Identifies recurring failure modes: reasoning faithfulness, benchmark biases, generalization limitations.

Read source

Your take?

Reasoning Benchmarks Evals Fine-tuning

Summary generated by Claude — human-verified

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

Other angles on this story