Analyzing Transcripts from AI Agent Evaluations

In December, CAISI published a write-up on how AI models can cheat on agentic evaluations, including lessons from our experience building and using AI-enabled transcript analysis tools to find and fix examples of cheating from our evaluations.

In that post, we highlighted the potential of AI-enabled transcript analysis tools to help evaluators scale their capacity to detect measurement issues in evaluations — particularly as they evaluate agentic AI systems that can work on tasks for longer periods of time. We emphasized the need for continued collaboration on shared practices and tooling to help the evaluation community adopt, scale and improve transcript review practices.

Recently, we contributed several of the practices and takeaways we identified in our research to a new joint research paper with the UK AI Security Institute and other AI evaluators. The paper outlines a multi-step process for building and using transcript review tools, from preparing log data to designing and validating a scanner in an iterative loop. At each step, it provides concrete examples and implementation considerations, based on experiences and takeaways aggregated from evaluators’ different transcript analysis projects and use cases.

The paper also includes implementation case studies using a new open-source transcript analysis framework, Inspect Scout, built by the UK AISI working closely with Meridian Labs. We’ve been able to collaborate with the developers to inform the design of features based on our own use cases, and are excited to see the development of more technical frameworks and tools that can help enable the wider adoption of transcript analysis by the AI evaluation community.

We’re excited to share these collaboratively developed examples and practices to aid other evaluators, and to continue our work to contribute to frameworks, tools, and practices that can help advance more rigorous, valid, and impactful AI measurement science.

What's On

NIST Awards $15 Million to ASTM International to Establish Standardization Center of Excellence

Bionny Company 2026: Wie ein 159€-Wearable ohne Abo den 60-Milliarden-Markt herausfordert — und was das für Unternehmen bedeutet

Manufacturing Outlook 2026: 12 Trends Reshaping Factories (AI, Robotics, Reshoring, Energy)

Why Poor Knowledge Management Is Costing Manufacturing Companies Millions (And How to Fix It)

Best Web Hosting Servers Compared 2026: Find Your Perfect Provider (Hostinger Ranked #1)

Analyzing Transcripts from AI Agent Evaluations

Reflections from the First Cyber AI Profile Workshop

Let’s get Digital! Updated Digital Identity Guidelines are Here!

Sharpening the Focus on Product Requirements and Cybersecurity Risks: Updating Foundational Activities for IoT Product Manufacturers

Bionny Company 2026: Wie ein 159€-Wearable ohne Abo den 60-Milliarden-Markt herausfordert — und was das für Unternehmen bedeutet

Manufacturing Outlook 2026: 12 Trends Reshaping Factories (AI, Robotics, Reshoring, Energy)

Why Poor Knowledge Management Is Costing Manufacturing Companies Millions (And How to Fix It)

What's On

Analyzing Transcripts from AI Agent Evaluations

Related Posts