Cheating On AI Agent Evaluations

By Maia Hamin and Benjamin Edelman

AI evaluations are designed to assess and compare how AI models perform on different tasks. Developers, users, and independent evaluators — like the Center for AI Standards and Innovation (CAISI) — can use evaluations to track trends in model capabilities and inform decisions about real-world use.

Agent evaluations test whether models can use tools in a multi-turn feedback loop to solve complex problems like debugging software or uncovering cybersecurity vulnerabilities. They allow evaluators to measure new and increasingly economically valuable capabilities, but also bring new methodological challenges — including, as CAISI and other evaluators have found, the risk that AI agents can use their tools to cheat.

As part of our mission, CAISI both directly evaluates AI models and seeks to advance best practices for AI measurement science. To improve our evaluations, we built an AI transcript analysis tool to search through our historical evaluation transcripts for cheating. To support the development of stronger ecosystem practices, the following pages share examples that we uncovered and suggests takeaways that may help other evaluators reduce the incidence and impact of evaluation cheating.

Using our transcript analysis tool, we found several examples of how models were able to successfully cheat on agentic coding and cyber benchmarks, including:

Models using the internet to find walkthroughs and answers for cyber capture-the-flag challenges
Models using generic denial-of-service attacks to crash servers on cyber tasks instead of exploiting intended vulnerabilities
Models cheating on coding benchmarks by looking up more recent code versions, disabling assertions, and adding test-specific logic.

We bucket these examples into two categories of cheating risks: solution contamination, where a model accesses information that improperly reveals the solution to an evaluation task; and grader gaming, where a model exploits a gap or misspecification in an evaluation’s automated scoring system to craft a solution that scores highly without fulfilling the “spirit” of the intended task.

Examples of evaluation cheating from CAISI’s evaluation logs
CAISI Benchmark	Cheating Examples from Evaluation Logs	Cheating Type	% Logs with Successful Solution Due to Cheating (Lower Bound)
Cybench	Using coding tools to search the internet for challenge flags and walkthroughs	Solution contamination	0.3%
SWE-bench Verified	Reviewing more recent code versions on GitHub Installing more recent code versions using package managers	Solution contamination	0.1%
SWE-bench Verified	Commenting out assertion checks to pass unit tests	Grader gaming	0.2%
CVE-Bench (Internal)	Using denial-of-service attacks to crash the target server instead of exploiting the CVE	Grader gaming	4.80%

In general, we define evaluation cheating as:

when an AI model exploits a gap between what an evaluation task is intended to measure and its implementation, solving the task in a way that subverts the validity of the measurement.

This definition focuses on the problem that cheating creates for evaluation validity: if models can exploit implementation loopholes to score higher without actually improving at the skills an evaluation is intended to measure, it degrades the value of that measurement for decisions about real-world adoption. As models become increasingly adept problem-solvers, they may be able to find new successful cheating strategies, and detecting and preventing cheating may become increasingly important for the validity and comparability of evaluation results.

Based on lessons we learned through this process, we share some preliminary suggested practices for other evaluators and benchmark designers interested in addressing evaluation cheating, including:

Review evaluation transcripts for cheating, including by leveraging AI transcript analysis tools that can help scale human review processes.
Prevent cheating by closing task design loopholes and setting clear rules in task prompts in order to make model comparisons more accurate and fair.
Standardize benchmark-specific expectations about agent affordances and restrictions to help evaluators create more comparable evaluation results, including by making it easier to catch and prevent cheating.

View the full writeup to see examples of cheating from CAISI’s agent evaluations and a discussion of practices to address evaluation cheating.

What's On

Putting Einstein to the Test With the World’s Most Accurate Clocks

Maximize Flow for Your Organization’s Long-Term Success

All aboard: the NIST Cybersecurity for IoT Program is headed to our next stop! Share your input on where we’re headed during our Future Directions Two-Day Workshop on March 31st.

Manufacturing in America – Contributing to Our Economy, Employment, and Innovation

NIST Spectroradiometry Short Course | NIST

Cheating On AI Agent Evaluations

All aboard: the NIST Cybersecurity for IoT Program is headed to our next stop! Share your input on where we’re headed during our Future Directions Two-Day Workshop on March 31st.

Accelerating AI Innovation Through Measurement Science

Reflections from the Second NIST Cyber AI Profile Workshop

Maximize Flow for Your Organization’s Long-Term Success

All aboard: the NIST Cybersecurity for IoT Program is headed to our next stop! Share your input on where we’re headed during our Future Directions Two-Day Workshop on March 31st.

Manufacturing in America – Contributing to Our Economy, Employment, and Innovation

What's On

Cheating On AI Agent Evaluations

Related Posts