ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints
Signal
72
Hype
18
In three linesToolMATH is a diagnostic benchmark for evaluating long-horizon tool use by language models. It converts math solutions into reusable Python tools with natural-language descriptions and typed schemas, then measures adaptability (success with replacement tools), robustness (stability under distractors), and tool connectivity (accuracy over long chains).Read source
Your take?
Summary generated by Claude — human-verified