LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
Signal
82
Hype
15
In three linesLongDS-Bench evaluates AI agents' ability to maintain analytical context over long horizons. The benchmark contains 68 multi-turn data analysis tasks (2,225 turns) from real Kaggle notebooks. Best models reach only 48.45% accuracy, with a 47-point performance drop from early to late turns. Long-horizon errors account for 52–69% of failures.Read source
Your take?
Summary generated by Claude — human-verified