Merrill Lutsky

Evaluating AI models for code comprehension

Graphite tested the latest AI models for code review and found a surprising winner. Newer, bigger models actually created more noise for developers.

 Evaluating AI models for code comprehension
#1about 2 minutes

The challenge of reviewing exponentially growing AI-generated code

The rapid increase in AI-generated code creates a significant bottleneck in the software development lifecycle, particularly in the code review stage.

#2about 3 minutes

How AI code generation strains the developer outer loop

While AI accelerates code writing (the inner loop), it overwhelms the critical outer loop processes of testing, reviewing, and deploying code.

#3about 1 minute

Introducing Diamond, an AI agent for automated code review

Graphite's AI agent, Diamond, acts as an always-on senior engineer within GitHub to summarize, prioritize, and review every code change.

#4about 3 minutes

Using comment acceptance rate to measure AI review quality

The primary metric for a successful AI reviewer is the acceptance rate of its comments, as every high-signal comment should result in a code change.

#5about 1 minute

Why evaluations are the key lever for LLM performance

Unlike traditional machine learning, optimizing large language models relies heavily on a robust evaluation process rather than other levers like feature engineering.

#6about 2 minutes

A methodology for evaluating AI code comprehension models

Models are evaluated against a large dataset of pull requests using two core metrics: matched comment rate for recall and unmatched comment rate for noise.

#7about 3 minutes

A comparative analysis of GPT-4.0, Opus, and Gemini

A detailed comparison reveals that models like GPT-4.0 excel at precision while Gemini has the best recall, but no single model wins on all metrics.

#8about 2 minutes

Evaluating Sonnet models and the problem of AI noise

Testing reveals that Sonnet 4.0 generates the most noise, making it less suitable for high-signal code review compared to its predecessors.

#9about 2 minutes

Why Sonnet 3.7 offers the best balance for code review

Sonnet 3.7 is the chosen model because it provides the optimal blend of strong reasoning, high recall of important issues, and low generation of noisy comments.

#10about 3 minutes

The critical role of continuous evaluation for new models

The key to leveraging AI effectively is to constantly re-evaluate new models, as shown by preliminary tests on GR four which revealed significant performance gaps.

Related jobs
Jobs that call for the skills explored in this talk.

Featured Partners

Related Articles

View all articles
DC
Daniel Cranney
Top AI Tools for Developers in 2025
AI is transforming the way developers work. Almost every aspect of development has become more efficient, from writing code, debugging and refactoring, to design and more. Whether you’re a seasoned developer or just getting started, these AI-powered ...
Top AI Tools for Developers in 2025

From learning to earning

Jobs that call for the skills explored in this talk.

AI Prompt Engineer

AI Prompt Engineer

SonarSource
Bochum, Germany

Remote
API
Python
Data analysis
Machine Learning
+2