Back to feed
arXiv cs.AI·

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Signal
78
Hype
15
In three linesAgentAtlas proposes multidimensional evaluation of LLM agents beyond simple success rates. The study introduces a 6-state control taxonomy, a 9-category error taxonomy, and audits 15 existing benchmarks. On 8 models (4 closed, 4 open-weight), removing explicit labels drops accuracy by 14-40 pp, revealing strong prompt dependency.
Read source
Your take?
AI AgentsBenchmarksEvals

Summary generated by Claude — human-verified