Back to feed
arXiv cs.AI·

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

Signal
82
Hype
15
In three linesEComAgentBench is a benchmark of 662 e-commerce tasks evaluating LLM-based shopping agents on hidden intents distributed across query, user profile, and clarifications. Requirements are scattered and agents must uncover them within 100 tool calls. The strongest model achieves only 57.1% accuracy.
Read source
Your take?
AI AgentsBenchmarksEvals

Summary generated by Claude — human-verified