Back to feed
arXiv cs.AI·

ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

Signal
82
Hype
25
In three linesContraFix is an agentic framework for automated vulnerability repair combining differential runtime evidence and skill reuse. On SEC-Bench (C/C++) and PatchEval (Go, Python, JavaScript), it achieves 84.0% and 73.8% resolution rates with GPT-4-mini, outperforming baselines while costing less than one-third of comparable approaches.

## ContraFix: Automated Vulnerability Repair via Differential Runtime Evidence

### 1. What's being announced

ContraFix is an agentic automated vulnerability repair (AVR) framework built on two distinct mechanisms: differential runtime evidence generation and cross-instance repair skill reuse. Evaluated on SEC-Bench (200 C/C++ instances) and PatchEval (225 instances across Go, Python, JavaScript), it achieves 84.0% and 73.8% resolution rates respectively using GPT-4-mini, at less than one-third the cost of the strongest comparable baseline.

### 2. The actual problem it solves

Current LLM agents for AVR fail primarily through semantic misunderstanding: they produce symptom-oriented patches rather than causal fixes. A crash report tells you *where* the program failed, but not *which* variable or state transition, among many candidates near the fault site, separates crashing from safe execution. This is the local causal ambiguity problem.

ContraFix addresses it through three architectural steps:

- **Mutator**: constructs PoC variants that straddle the failure boundary — some trigger the crash, others don't. This straddling technique isolates the necessary and sufficient conditions for the vulnerability. - **Analyzer**: inserts state probes around the fault region and synthesizes divergences between crashing and non-crashing executions into a structured *repair specification*. - **Patcher**: converts that specification into verified source patches.

The second mechanism is a **two-track skill base**: each successful repair feeds a repository of repair specifications and mutation strategies, retrieved via a three-tier policy for future instances. This eliminates from-scratch diagnosis on similar cases — a recurring cost that prior approaches ignored entirely.

### 3. Why the numbers matter

The benchmarks are not trivial. SEC-Bench covers real C/C++ with memory-class vulnerabilities (buffer overflow, use-after-free) that have historically resisted automated repair. PatchEval spans three semantically distinct languages (Go, Python, JavaScript), testing cross-language generalization.

The cost-to-performance ratio is the most significant figure: less than one-third the cost of the strongest baseline at higher resolution rates. Using GPT-4-mini rather than GPT-4 or GPT-4o strongly suggests the architecture compensates for raw model capability. This is a direct signal that the bottleneck in current AVR is causal reasoning, not code generation capacity.

Before ContraFix, the best published AVR systems on comparable multi-language benchmarks plateaued around 60-70%, and required more expensive models. The jump to 84.0% on SEC-Bench is substantial on a C/C++ benchmark where memory errors demand precise understanding of pointer lifecycle semantics.

### 4. Potential losers and open limits

**Direct losers**: AVR approaches based on direct patch generation without differential execution (CodeAct-style agents, SWE-agent variants without runtime instrumentation) lose positioning. Commercial SAST/DAST tools offering automated fix suggestions — Snyk, Semgrep Autofix, GitHub Copilot Autofix — face exposure if this architecture pattern becomes standard.

**Unresolved limits**: the dependency on differential execution assumes PoCs are available or constructible — which is not always true for logic vulnerabilities or authentication flaws without observable crashes. The skill base accrues value with volume, creating an advantage for high-throughput deployers and a disadvantage for one-off use cases. The evaluation also remains on academic benchmarks; performance on recent CVEs in complex real-world repositories is undemonstrated.

The ContraFix architecture is reproducible with any tool-calling-capable LLM. The deliberate choice of GPT-4-mini as the reference model is conservative and strengthens result credibility — the gains come from the framework design, not from model scale.

Read source
Your take?
AI AgentsCode generationReasoningBenchmarksGPT

Summary generated by Claude — human-verified