arXiv cs.AI·22 May 2026

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Signal

Hype

In three linesAgentAtlas proposes multidimensional evaluation of LLM agents beyond simple success rates. The study introduces a 6-state control taxonomy, a 9-category error taxonomy, and audits 15 existing benchmarks. On 8 models (4 closed, 4 open-weight), removing explicit labels drops accuracy by 14-40 pp, revealing strong prompt dependency.

Read source

Your take?

AI Agents Benchmarks Evals

Summary generated by Claude — human-verified

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Other angles on this story