Back to feed
arXiv cs.CL·

ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints

Signal
72
Hype
18
In three linesToolMATH is a diagnostic benchmark for evaluating long-horizon tool use by language models. It converts math solutions into reusable Python tools with natural-language descriptions and typed schemas, then measures adaptability (success with replacement tools), robustness (stability under distractors), and tool connectivity (accuracy over long chains).
Read source
Your take?
BenchmarksAI AgentsToolsReasoning

Summary generated by Claude — human-verified