LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]
Signal
78
Hype
15
In three linesCVE-Bench evaluates 5 frontier models on 20 real-world CVEs (Pillow, GitPython, urllib3, etc.) across 300 runs. Max solve rate 50% (60% under advisory). Agents patch syntactically but leave vulnerabilities open. Significant cross-family gaps (OpenAI vs Laguna, p<0.05), within-family noise. Failure modes: wrong-search drift, hallucinations, context loss.Read source
Your take?
Summary generated by Claude — human-verified