Back to feed
arXiv cs.LG·

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Signal
82
Hype
15
In three linesLongDS-Bench evaluates AI agents' ability to maintain analytical context over long horizons. The benchmark contains 68 multi-turn data analysis tasks (2,225 turns) from real Kaggle notebooks. Best models reach only 48.45% accuracy, with a 47-point performance drop from early to late turns. Long-horizon errors account for 52–69% of failures.
Read source
Your take?
AI AgentsBenchmarksEvalsReasoning

Summary generated by Claude — human-verified