EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions
In three linesEvoCode-Bench evaluates 13 coding agents on 26 tasks with 5-15 iterative rounds. Agents must maintain a working codebase as specifications change. Results: 22-40 point gap between single-round (SR) and multi-turn (MT@4) performance, <50% success on multi-turn metrics, and progressive degradation (pass rate halved by round 5).
## EvoCode-Bench: coding agents collapse by round 5
### 1. What is being measured — and why it was missing
Nearly every code generation benchmark (HumanEval, MBPP, classic SWE-bench) evaluates an atomic task: one specification → one solution → one verdict. This protocol ignores the reality of software development, where requirements evolve and code produced at turn N must remain functional at turn N+5. EvoCode-Bench fills this gap with 26 stateful tasks and 227 evaluated rounds, each task spanning 5 to 15 iterative turns. The agent's workspace is preserved between turns — files written at turn 1 are still present at turn 8. Tests are cumulative: each new round checks new requirements *and* all previously active ones. It is an automated regression test built into the benchmark itself.
### 2. The numbers that matter
Two metrics structure the analysis: - **SR (Single-Round)**: score on a pre-completed reference state, equivalent to the classic paradigm. - **MT@4**: multi-turn score with up to 4 attempts per round before fail-stop.
The SR − MT@4 gap ranges from **22 to 40 points** across agents. This is not an artifact of intrinsic task difficulty: it is the degradation caused by state accumulation and tracking of changing specifications.
The most instructive case: the agent with the **highest SR (78.9)** ranks only **third in MT@4 (44.0)**. An agent capable of solving isolated problems with excellence can be mediocre when it must maintain a coherent codebase over time. Rankings established by single-round benchmarks are therefore partially misleading for predicting real-world utility in iterative development.
Another critical data point: **aggregate pass rate drops below 50% of round-1 performance by round 5**. Even the strongest agents see their performance halved in five turns. No agent exceeds ~50% success on multi-turn metrics.
### 3. Tier-dependent failure behavior
Failure analysis reveals clear stratification: - **Weaker agents**: fail early, often in the first few turns, on basic tasks. - **Stronger agents**: survive longer but expose qualitatively different failures — *specification tracking* (loss of context on prior requirements) and *regression failures* (modifications that break previously working code).
This distinction matters for agent engineering: the problems of strong agents are not raw code generation capacity problems, but long-context management and state coherence problems. These are architectural issues (context window, external memory, codebase re-reading strategy) as much as model issues.
### 4. Harbor infrastructure and practical implications
The authors also release **Harbor**, the multi-turn infrastructure used to orchestrate evaluations. This may be as significant as the benchmark itself: Harbor enables replaying turn sequences with state preservation, opening the door to reproducible evaluations of agents in iterative scenarios.
**Direct losers from this publication:** - Agents whose high SR scores masked multi-turn weaknesses — their positioning becomes contestable. - Teams that optimized agents exclusively on HumanEval/MBPP: these benchmarks do not predict behavior in iterative contexts. - Internal company evaluations based on single-round metrics to decide whether to deploy coding agents in production.
**What concretely changes:** before EvoCode-Bench, no standardized protocol existed to measure progressive degradation of coding agents. Practitioners deploying agents on refactoring, maintenance, or feature evolution tasks were operating without quantitative signal on reliability beyond the first turn. With 26 tasks, 227 rounds, and open-source infrastructure, the benchmark is concrete enough to integrate into internal evaluation pipelines — even if 26 tasks remain a limited corpus for definitive conclusions on specific domains.
Summary generated by Claude — human-verified