Why we no longer evaluate SWE-bench Verified
Signal
75
Hype
25
In three linesOpenAI stops evaluating on SWE-bench Verified, citing contamination and poor measurement of frontier coding progress. Analysis reveals flawed tests and training leakage. OpenAI recommends SWE-bench Pro instead.Read source
Your take?
Summary generated by Claude — human-verified