arXiv cs.AI·19 May 2026

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

Signal

Hype

In three linesMM-ToolBench is a benchmark for omni-modal tool-using agents in real-world workflows. 100 executable tasks (customer service, intelligent creation), 27 MCP servers, 324 tools. Closed-loop multimodal verification: agents execute, inspect, and self-correct. Claude Opus 4.6 achieves 32% success vs 94% human baseline.

Read source

Your take?

AI Agents MCP Benchmarks Claude Vision

Summary generated by Claude — human-verified

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

Other angles on this story