ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse
In three linesContraFix is an agentic framework for automated vulnerability repair combining differential runtime evidence and skill reuse. On SEC-Bench (C/C++) and PatchEval (Go, Python, JavaScript), it achieves 84.0% and 73.8% resolution rates with GPT-4-mini, outperforming baselines while costing less than one-third of comparable approaches.
## ContraFix: Automated Vulnerability Repair via Differential Runtime Evidence
### 1. What's being announced
ContraFix is an agentic automated vulnerability repair (AVR) framework built on two distinct mechanisms: differential runtime evidence generation and cross-instance repair skill reuse. Evaluated on SEC-Bench (200 C/C++ instances) and PatchEval (225 instances across Go, Python, JavaScript), it achieves 84.0% and 73.8% resolution rates respectively using GPT-4-mini, at less than one-third the cost of the strongest comparable baseline.
### 2. The actual problem it solves
Current LLM agents for AVR fail primarily through semantic misunderstanding: they produce symptom-oriented patches rather than causal fixes. A crash report tells you *where* the program failed, but not *which* variable or state transition, among many candidates near the fault site, separates crashing from safe execution. This is the local causal ambiguity problem.
ContraFix addresses it through three architectural steps:
- **Mutator**: constructs PoC variants that straddle the failure boundary — some trigger the crash, others don't. This straddling technique isolates the necessary and sufficient conditions for the vulnerability. - **Analyzer**: inserts state probes around the fault region and synthesizes divergences between crashing and non-crashing executions into a structured *repair specification*. - **Patcher**: converts that specification into verified source patches.
The second mechanism is a **two-track skill base**: each successful repair feeds a repository of repair specifications and mutation strategies, retrieved via a three-tier policy for future instances. This eliminates from-scratch diagnosis on similar cases — a recurring cost that prior approaches ignored entirely.
### 3. Why the numbers matter
The benchmarks are not trivial. SEC-Bench covers real C/C++ with memory-class vulnerabilities (buffer overflow, use-after-free) that have historically resisted automated repair. PatchEval spans three semantically distinct languages (Go, Python, JavaScript), testing cross-language generalization.
The cost-to-performance ratio is the most significant figure: less than one-third the cost of the strongest baseline at higher resolution rates. Using GPT-4-mini rather than GPT-4 or GPT-4o strongly suggests the architecture compensates for raw model capability. This is a direct signal that the bottleneck in current AVR is causal reasoning, not code generation capacity.
Before ContraFix, the best published AVR systems on comparable multi-language benchmarks plateaued around 60-70%, and required more expensive models. The jump to 84.0% on SEC-Bench is substantial on a C/C++ benchmark where memory errors demand precise understanding of pointer lifecycle semantics.
### 4. Potential losers and open limits
**Direct losers**: AVR approaches based on direct patch generation without differential execution (CodeAct-style agents, SWE-agent variants without runtime instrumentation) lose positioning. Commercial SAST/DAST tools offering automated fix suggestions — Snyk, Semgrep Autofix, GitHub Copilot Autofix — face exposure if this architecture pattern becomes standard.
**Unresolved limits**: the dependency on differential execution assumes PoCs are available or constructible — which is not always true for logic vulnerabilities or authentication flaws without observable crashes. The skill base accrues value with volume, creating an advantage for high-throughput deployers and a disadvantage for one-off use cases. The evaluation also remains on academic benchmarks; performance on recent CVEs in complex real-world repositories is undemonstrated.
The ContraFix architecture is reproducible with any tool-calling-capable LLM. The deliberate choice of GPT-4-mini as the reference model is conservative and strengthens result credibility — the gains come from the framework design, not from model scale.
Summary generated by Claude — human-verified