Reddit r/MachineLearning·2 June 2026

LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]

Signal

Hype

In three linesCVE-Bench evaluates 5 frontier models on 20 real-world CVEs (Pillow, GitPython, urllib3, etc.) across 300 runs. Max solve rate 50% (60% under advisory). Agents patch syntactically but leave vulnerabilities open. Significant cross-family gaps (OpenAI vs Laguna, p<0.05), within-family noise. Failure modes: wrong-search drift, hallucinations, context loss.

Read source

Your take?

AI Agents Benchmarks AI safety Evals Code generation

Summary generated by Claude — human-verified

LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]

Other angles on this story