Close Menu
ManufacturingManufacturing
  • Home
  • Automation
  • Industrial Data & AI
  • Innovation
  • Leadership
  • Sustainability
  • More
    • Digital Transformation
    • Web Stories
    • Press Release
    • Spotlight
What's On
Putting Einstein to the Test With the World’s Most Accurate Clocks

Putting Einstein to the Test With the World’s Most Accurate Clocks

16 April 2026
Maximize Flow for Your Organization’s Long-Term Success

Maximize Flow for Your Organization’s Long-Term Success

16 April 2026

All aboard: the NIST Cybersecurity for IoT Program is headed to our next stop! Share your input on where we’re headed during our Future Directions Two-Day Workshop on March 31st.

16 April 2026
Manufacturing in America – Contributing to Our Economy, Employment, and Innovation

Manufacturing in America – Contributing to Our Economy, Employment, and Innovation

16 April 2026
NIST Spectroradiometry Short Course | NIST

NIST Spectroradiometry Short Course | NIST

15 April 2026
Facebook X (Twitter) Instagram
Facebook X (Twitter) Instagram
ManufacturingManufacturing
Subscribe
  • Home
  • Automation
  • Industrial Data & AI
  • Innovation
  • Leadership
  • Sustainability
  • More
    • Digital Transformation
    • Web Stories
    • Press Release
    • Spotlight
ManufacturingManufacturing
Home » Cheating On AI Agent Evaluations
Industrial Data & AI

Cheating On AI Agent Evaluations

manufacturing.com.deBy manufacturing.com.de15 April 2026No Comments4 Mins Read
Facebook Twitter LinkedIn Telegram Pinterest Tumblr Reddit WhatsApp Email
Cheating On AI Agent Evaluations
Share
Facebook Twitter LinkedIn Pinterest Email

By Maia Hamin and Benjamin Edelman

AI evaluations are designed to assess and compare how AI models perform on different tasks. Developers, users, and independent evaluators — like the Center for AI Standards and Innovation (CAISI) — can use evaluations to track trends in model capabilities and inform decisions about real-world use.

Agent evaluations test whether models can use tools in a multi-turn feedback loop to solve complex problems like debugging software or uncovering cybersecurity vulnerabilities. They allow evaluators to measure new and increasingly economically valuable capabilities, but also bring new methodological challenges — including, as CAISI and other evaluators have found, the risk that AI agents can use their tools to cheat.

As part of our mission, CAISI both directly evaluates AI models and seeks to advance best practices for AI measurement science. To improve our evaluations, we built an AI transcript analysis tool to search through our historical evaluation transcripts for cheating. To support the development of stronger ecosystem practices, the following pages share examples that we uncovered and suggests takeaways that may help other evaluators reduce the incidence and impact of evaluation cheating.

Using our transcript analysis tool, we found several examples of how models were able to successfully cheat on agentic coding and cyber benchmarks, including:

  • Models using the internet to find walkthroughs and answers for cyber capture-the-flag challenges
  • Models using generic denial-of-service attacks to crash servers on cyber tasks instead of exploiting intended vulnerabilities
  • Models cheating on coding benchmarks by looking up more recent code versions, disabling assertions, and adding test-specific logic.

We bucket these examples into two categories of cheating risks: solution contamination, where a model accesses information that improperly reveals the solution to an evaluation task; and grader gaming, where a model exploits a gap or misspecification in an evaluation’s automated scoring system to craft a solution that scores highly without fulfilling the “spirit” of the intended task.

Examples of evaluation cheating from CAISI’s evaluation logs
CAISI Benchmark Cheating Examples from Evaluation Logs Cheating Type % Logs with Successful Solution Due to Cheating (Lower Bound)
Cybench
  • Using coding tools to search the internet for challenge flags and walkthroughs
Solution contamination 0.3%
SWE-bench Verified
  • Reviewing more recent code versions on GitHub
  • Installing more recent code versions using package managers
Solution contamination 0.1%
  • Commenting out assertion checks to pass unit tests
Grader gaming 0.2%
CVE-Bench (Internal)
  • Using denial-of-service attacks to crash the target server instead of exploiting the CVE
Grader gaming  4.80%

In general, we define evaluation cheating as:

when an AI model exploits a gap between what an evaluation task is intended to measure and its implementation, solving the task in a way that subverts the validity of the measurement.

This definition focuses on the problem that cheating creates for evaluation validity: if models can exploit implementation loopholes to score higher without actually improving at the skills an evaluation is intended to measure, it degrades the value of that measurement for decisions about real-world adoption. As models become increasingly adept problem-solvers, they may be able to find new successful cheating strategies, and detecting and preventing cheating may become increasingly important for the validity and comparability of evaluation results.

Based on lessons we learned through this process, we share some preliminary suggested practices for other evaluators and benchmark designers interested in addressing evaluation cheating, including:

  • Review evaluation transcripts for cheating, including by leveraging AI transcript analysis tools that can help scale human review processes.
  • Prevent cheating by closing task design loopholes and setting clear rules in task prompts in order to make model comparisons more accurate and fair.
  • Standardize benchmark-specific expectations about agent affordances and restrictions to help evaluators create more comparable evaluation results, including by making it easier to catch and prevent cheating.

View the full writeup to see examples of cheating from CAISI’s agent evaluations and a discussion of practices to address evaluation cheating.

Share. Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email
Previous ArticleAnalyzing Transcripts from AI Agent Evaluations
Next Article Reflections from the Second NIST Cyber AI Profile Workshop

Related Posts

All aboard: the NIST Cybersecurity for IoT Program is headed to our next stop! Share your input on where we’re headed during our Future Directions Two-Day Workshop on March 31st.

16 April 2026

Accelerating AI Innovation Through Measurement Science

15 April 2026
Cheating On AI Agent Evaluations

Reflections from the Second NIST Cyber AI Profile Workshop

15 April 2026
Top Posts
Maximize Flow for Your Organization’s Long-Term Success

Maximize Flow for Your Organization’s Long-Term Success

16 April 2026

All aboard: the NIST Cybersecurity for IoT Program is headed to our next stop! Share your input on where we’re headed during our Future Directions Two-Day Workshop on March 31st.

16 April 2026
Manufacturing in America – Contributing to Our Economy, Employment, and Innovation

Manufacturing in America – Contributing to Our Economy, Employment, and Innovation

16 April 2026

Subscribe to Updates

Get the latest Manufacturing news and updates directly to your inbox.

© 2026 Manufacturing. All Rights Reserved.
  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Type above and press Enter to search. Press Esc to cancel.