Back to feed
arXiv cs.AI·

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Signal
72
Hype
15
In three linesComparative study of three LLM approaches on 1,000 math problems (GSM-Symbolic): chain-of-thought (CoT), Program-Aided Language models (PAL), and Step-by-Step Coding (SBSC). CoT proves more robust to variations (1.3pp drop vs 1.7pp for PAL), contradicting the hypothesis that code execution improves reasoning robustness.
Read source
Your take?
ReasoningCode generationBenchmarksClaude

Summary generated by Claude — human-verified