TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
Signal
82
Hype
25
In three linesMM-ToolBench is a benchmark for omni-modal tool-using agents in real-world workflows. 100 executable tasks (customer service, intelligent creation), 27 MCP servers, 324 tools. Closed-loop multimodal verification: agents execute, inspect, and self-correct. Claude Opus 4.6 achieves 32% success vs 94% human baseline.Read source
Your take?
Summary generated by Claude — human-verified