Back to feed
arXiv cs.AI·

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

Signal
82
Hype
25
In three linesMM-ToolBench is a benchmark for omni-modal tool-using agents in real-world workflows. 100 executable tasks (customer service, intelligent creation), 27 MCP servers, 324 tools. Closed-loop multimodal verification: agents execute, inspect, and self-correct. Claude Opus 4.6 achieves 32% success vs 94% human baseline.
Read source
Your take?
AI AgentsMCPBenchmarksClaudeVision

Summary generated by Claude — human-verified