arXiv cs.CL·19 May 2026

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

Signal

Hype

In three linesToolMATH is a diagnostic benchmark for evaluating long-horizon tool use by language models. It converts math solutions into reusable Python tools with natural-language descriptions and typed schemas, then measures adaptability (success with replacement tools), robustness (stability under distractors), and tool connectivity (accuracy over long chains).

Read source

Your take?

Benchmarks AI Agents Tools Reasoning

Summary generated by Claude — human-verified

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

Other angles on this story