Why SWE-bench Verified no longer measures frontier coding capabilities
Summary
The authors have stopped reporting SWE-bench Verified scores because the benchmark is no longer a reliable measure of frontier model progress in autonomous software engineering. An analysis revealed two major issues: at least 59.4% of audited problems have flawed test cases that reject functionally correct solutions (due to being too narrow or too wide), and frontier models show evidence of training data contamination, as they can reproduce the original human-written 'gold patch' or verbatim problem details. This contamination suggests performance improvements increasingly reflect exposure to the benchmark during training rather than genuine real-world software development ability. Consequently, they recommend using SWE-bench Pro instead and are investing in new, uncontaminated evaluations like GDPVal.
(Source:OpenAI)