Open-source benchmark EVMbench tests how well AI agents handle smart contract exploits - Help Net Security

Help Net Security
EVMbench is a new open-source benchmark by OpenAI and Paradigm to test AI agents on detecting, patching, and exploiting smart contract vulnerabilities.

Summary

EVMbench is a new open-source benchmark, developed by OpenAI and Paradigm, designed to rigorously test AI agents on practical security tasks related to Ethereum Virtual Machine (EVM) smart contracts, which frequently control significant assets.

The benchmark focuses on three core tasks: detecting known vulnerabilities in audited code, patching vulnerable code while maintaining functionality, and successfully executing exploits in a controlled, sandboxed environment.

EVMbench uses a dataset of 120 curated vulnerabilities from real audits and competitions, employing containerized environments and automated, deterministic scoring for repeatability. Initial results indicate uneven performance, with exploit tasks proving difficult for many systems, although exploit success rates have recently improved significantly; however, patching remains a major weakness for current models.

(Source:Help Net Security)