arXiv cs.LG·1 June 2026

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Signal

Hype

In three linesLongDS-Bench evaluates AI agents' ability to maintain analytical context over long horizons. The benchmark contains 68 multi-turn data analysis tasks (2,225 turns) from real Kaggle notebooks. Best models reach only 48.45% accuracy, with a 47-point performance drop from early to late turns. Long-horizon errors account for 52–69% of failures.

Read source

Your take?

AI Agents Benchmarks Evals Reasoning

Summary generated by Claude — human-verified

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Other angles on this story