arXiv cs.AI·27 May 2026

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Signal

Hype

In three linesComparative study of three LLM approaches on 1,000 math problems (GSM-Symbolic): chain-of-thought (CoT), Program-Aided Language models (PAL), and Step-by-Step Coding (SBSC). CoT proves more robust to variations (1.3pp drop vs 1.7pp for PAL), contradicting the hypothesis that code execution improves reasoning robustness.

Read source

Your take?

Reasoning Code generation Benchmarks Claude

Summary generated by Claude — human-verified

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Other angles on this story