Back to feed
arXiv cs.AI·

CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks?

Signal
82
Hype
18
In three linesCODA-BENCH is the first benchmark jointly evaluating code and data intelligence in AI agents. Built on the Kaggle ecosystem with 1,009 tasks and ~980 files per environment, it reveals that top agents achieve only 61.1% success rate when integrating data discovery with code execution.

## CODA-BENCH: Exposing the Real Ceiling of Data-Intensive AI Agents

### 1. What's being measured — and why it was missing

Existing code agent benchmarks (SWE-bench, HumanEval, DS-1000) evaluate either pure code generation or structured data manipulation, never both simultaneously in a large-scale, noisy environment. Yet any data engineer knows that 60–70% of real work involves *finding* the right files before writing a single line of code. CODA-BENCH (arXiv:2606.15300) closes this gap by imposing a dual constraint on agents: complex filesystem exploration combined with analytical code generation.

The benchmark is built on the Kaggle ecosystem — hundreds of public datasets organized across 31 thematic communities. Each task environment contains an average of **980 files**, simulating the noise and density of a real data project. The 1,009 tasks cover realistic analytical scenarios: aggregations, multi-file joins, conditional transformations, and visualizations.

### 2. The numbers that matter

The headline result: **61.1% success rate for the best-performing agent tested**. This ceiling is striking when compared to performance on isolated benchmarks — where the same models approach or exceed 80–90% on pure code tasks (GPT-4o on HumanEval: ~90%). The ~30-point gap is not an artifact of added algorithmic difficulty: it reflects a structural inability of current agents to orchestrate resource discovery and code execution in a coherent pipeline.

The 31 communities enable domain-level analysis: certain verticals (finance, biology) feature deeper file hierarchies and less standardized schemas, further degrading performance. The benchmark explicitly distinguishes *data discovery* errors (wrong file selected) from *code execution* errors (right file, wrong processing) — a methodologically valuable diagnostic split.

### 3. Why the gap exists — structural analysis

Three mechanisms explain the 61.1% ceiling:

**a) Cost of unguided exploration.** With ~980 files per environment, an agent lacking an efficient indexing strategy burns a significant fraction of its context window on navigation. Current agents tend to either over-explore (context overflow) or under-explore (first plausible file retained without verification).

**b) Absence of persistent episodic memory.** LLM agents without external memory cannot build a mental map of the filesystem across tool calls. Each sub-task restarts from scratch, multiplying consistency errors between steps.

**c) Misalignment between reward signal and optimal behavior.** Agents trained on pure code optimize for syntactically correct output, not for validating that input data actually matches the problem specification. This training bias manifests directly in data discovery failures.

### 4. Losers and practical implications

**Direct losers:** Agent frameworks positioning themselves on data pipeline automation (AutoGPT-style systems, certain Copilot agents) have their real ceiling exposed. A 61.1% rate in a controlled environment implies substantially lower performance in production, where filesystems are even less structured.

**Indirect losers:** Teams that have deployed autonomous agents on data analysis tasks without an appropriate validation benchmark. CODA-BENCH now provides a qualification tool that these deployments lacked.

**What the benchmark doesn't yet measure:** multi-agent collaborative tasks, streaming data environments, and scenarios with partially corrupted or poorly named files — dimensions that would further degrade scores.

For practitioners, CODA-BENCH establishes a new evaluation standard for any agent intended to operate on real data environments. The 61.1% score should be read as a reference floor, not a ceiling: architectures integrating filesystem RAG, episodic memory, and schema validation before execution should mechanically improve. The benchmark is available via arXiv:2606.15300.

Read source
Your take?
AI AgentsBenchmarksCode generationEvals

Summary generated by Claude — human-verified