Smart contract bugs have caused some of the biggest losses in the crypto industry. In many cases, a small mistake in the code was all it took for attackers to step in and drain millions of dollars from decentralized platforms.
As more money continues to move onto blockchains, the stakes keep rising. There is simply less room for error. At the same time, AI tools are playing a bigger role in writing and reviewing code, which makes it even more important to understand how reliable these systems actually are before trusting them with high value financial infrastructure.
That urgency has led to a new approach. In response, OpenAI has partnered with Paradigm to launch EVMbench, an evaluation system built to measure how effectively AI models can detect, fix, and even simulate attacks on vulnerabilities in smart contracts. The goal is not marketing, but rigorous testing of real world performance in high stakes financial code.
Introducing EVMbench—a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities. https://t.co/op5zufgAGH
— OpenAI (@OpenAI) February 18, 2026
Moving the Technical Details Into Focus
At its foundation, EVMbench focuses on the Ethereum Virtual Machine, often called the EVM. This is the system that runs smart contracts on Ethereum and many other compatible blockchains. It is the engine that makes decentralized applications work.
Smart contracts operating in this environment are responsible for securing well over 100 billion dollars in digital assets. That scale is exactly why testing AI systems against real smart contract risks matters so much.
As the company noted in its announcement, “As AI agents improve at reading, writing, and executing code, it becomes increasingly important to measure their capabilities in economically meaningful environments, and to encourage the use of AI systems defensively to audit and strengthen deployed contracts.” The benchmark is meant to provide that understanding.
What makes EVMbench different is that it is built on real audit findings, not made-up examples. The dataset includes 120 serious vulnerabilities taken from 40 past smart contract audits, including public audit competitions. It also draws from reviews of the Tempo blockchain, a payments network designed for stablecoin transactions. Because the issues are based on real cases that once affected live systems, the benchmark reflects problems that have actually cost projects money.
The benchmark tests AI systems in three clear ways. In detection mode, models review contract code and try to spot known weaknesses. In patch mode, they must fix those weaknesses without damaging the rest of the code. In exploit mode, the models attempt to simulate an attack, such as draining funds from a vulnerable contract, inside a closed testing setup. These attack scripts are based on previously disclosed cases, and none of the tests touch live networks.
Early results show fast improvement in some areas, but clear gaps in others. In exploit mode, newer coding-focused models performed far better than earlier versions. One advanced model scored more than twice as high as a predecessor released only months before. This suggests AI systems are getting better at completing structured technical tasks in controlled environments.
However, the mixed performance also shows why careful testing is important. While models are improving at carrying out simulated attacks, they are still less reliable at identifying vulnerabilities and fixing them correctly in complex financial code.
By launching EVMbench, OpenAI and Paradigm are not claiming that AI can replace human auditors. Instead, they are emphasizing the need to measure and understand these systems before relying on them in high-stakes environments. In a space where billions of dollars depend on secure smart contracts, knowing how well AI performs may be just as important as making the models more powerful.
