As artificial intelligence (AI) continues to evolve, its potential to replace human jobs has become a topic of concern. One such area is smart contract auditing.
A recent experiment conducted by OpenZeppelin, a leading blockchain security company, sought to explore this possibility by pitting ChatGPT-4, an AI model developed by OpenAI, against 28 Ethernaut challenges designed to identify smart contract vulnerabilities.
Ethernaut challenges are a series of puzzles and problems designed to test and improve a user’s understanding of ethereum (ETH) smart contract vulnerabilities.
Created by OpenZeppelin, a leading blockchain security company, these challenges are part of a game-like platform called Ethernaut. Each challenge presents a unique smart contract with a specific vulnerability that players must identify and exploit to solve the challenge.
The levels range in difficulty and cover a variety of common vulnerabilities found in smart contracts, such as re-entrancy attacks, underflows and overflows, and more. By working through these challenges, players can gain a deeper understanding of smart contract security and ethereum development.
The experiment involved presenting the AI with the code for a given Ethernaut level and asking it to identify any vulnerabilities. GPT-4 was able to solve 19 out of the 23 challenges that were introduced before its training data cutoff date of September 2021.
However, it fared poorly on Ethernaut’s newest levels, failing at 4 out of 5. This suggests that while AI can be a useful tool for identifying some security vulnerabilities, it cannot replace the need for a human auditor.
One significant factor for ChatGPT’s success with levels 1-23 is the possibility that GPT-4’s training data contained several solution write-ups for these levels.
Levels 24-28 were released after the 2021 cutoff for GPT-4’s training data, so the inability to solve these levels further points to ChatGPT’s training data including published solutions as a likely explanation for its success.
The AI’s performance was also influenced by its “temperature” setting, which affects the randomness of its responses. With values closer to 2, ChatGPT generates more creative responses, while lower values closer to 0 yield more focused and deterministic answers.
Despite its successes, GPT-4 struggled with certain challenges, often requiring specific follow-up questions to hone in on the vulnerability.
In some cases, even with strong guidance, the AI failed to produce a correct strategy. This underscores the potential for AI tools to increase audit efficiency when the auditor knows specifically what to look for and how to prompt Large Language Models like ChatGPT effectively.
However, the experiment also revealed that in-depth security knowledge is necessary to assess whether the answer provided by AI is accurate or nonsensical. For example, in Level 24, “PuzzleWallet,” GPT-4 invented a vulnerability related to multicall and falsely claimed that it was not possible for an attacker to become the owner of the wallet.
While the experiment demonstrated that smart contract analysis performed by GPT-4 cannot replace a human security audit, it did show that AI can be a useful tool for finding some security vulnerabilities.
Given the rapid pace of innovation in blockchain and smart contract development, it’s crucial for humans to stay up-to-date on the latest attack vectors and innovations across Web3.
OpenZeppelin’s growing AI team is currently experimenting with OpenAI as well as custom machine learning solutions to improve smart contract vulnerability detection. The goal is to help OpenZeppelin auditors improve coverage and complete audits more efficiently.

 
			 
                                     
                                    