Merrill Lutsky
Evaluating AI models for code comprehension
#1about 2 minutes
The challenge of reviewing exponentially growing AI-generated code
The rapid increase in AI-generated code creates a significant bottleneck in the software development lifecycle, particularly in the code review stage.
#2about 3 minutes
How AI code generation strains the developer outer loop
While AI accelerates code writing (the inner loop), it overwhelms the critical outer loop processes of testing, reviewing, and deploying code.
#3about 1 minute
Introducing Diamond, an AI agent for automated code review
Graphite's AI agent, Diamond, acts as an always-on senior engineer within GitHub to summarize, prioritize, and review every code change.
#4about 3 minutes
Using comment acceptance rate to measure AI review quality
The primary metric for a successful AI reviewer is the acceptance rate of its comments, as every high-signal comment should result in a code change.
#5about 1 minute
Why evaluations are the key lever for LLM performance
Unlike traditional machine learning, optimizing large language models relies heavily on a robust evaluation process rather than other levers like feature engineering.
#6about 2 minutes
A methodology for evaluating AI code comprehension models
Models are evaluated against a large dataset of pull requests using two core metrics: matched comment rate for recall and unmatched comment rate for noise.
#7about 3 minutes
A comparative analysis of GPT-4.0, Opus, and Gemini
A detailed comparison reveals that models like GPT-4.0 excel at precision while Gemini has the best recall, but no single model wins on all metrics.
#8about 2 minutes
Evaluating Sonnet models and the problem of AI noise
Testing reveals that Sonnet 4.0 generates the most noise, making it less suitable for high-signal code review compared to its predecessors.
#9about 2 minutes
Why Sonnet 3.7 offers the best balance for code review
Sonnet 3.7 is the chosen model because it provides the optimal blend of strong reasoning, high recall of important issues, and low generation of noisy comments.
#10about 3 minutes
The critical role of continuous evaluation for new models
The key to leveraging AI effectively is to constantly re-evaluate new models, as shown by preliminary tests on GR four which revealed significant performance gaps.
Related jobs
Jobs that call for the skills explored in this talk.
Wilken GmbH
Ulm, Germany
Senior
Kubernetes
AI Frameworks
+3
Eltemate
Amsterdam, Netherlands
Intermediate
Senior
TypeScript
Continuous Integration
+1
Matching moments
06:46 MIN
How AI-generated content is overwhelming open source maintainers
WeAreDevelopers LIVE – You Don’t Need JavaScript, Modern CSS and More
04:05 MIN
How AI code generators have become more reliable
AI in the Open and in Browsers - Tarek Ziadé
03:07 MIN
Final advice for developers adapting to AI
WeAreDevelopers LIVE – AI, Freelancing, Keeping Up with Tech and More
09:10 MIN
How AI is changing the freelance developer experience
WeAreDevelopers LIVE – AI, Freelancing, Keeping Up with Tech and More
06:28 MIN
Using AI agents to modernize legacy COBOL systems
Devs vs. Marketers, COBOL and Copilot, Make Live Coding Easy and more - The Best of LIVE 2025 - Part 3
04:28 MIN
Building an open source community around AI models
AI in the Open and in Browsers - Tarek Ziadé
05:55 MIN
The security risks of AI-generated code and slopsquatting
Slopquatting, API Keys, Fun with Fonts, Recruiters vs AI and more - The Best of LIVE 2025 - Part 2
01:02 MIN
AI lawsuits, code flagging, and self-driving subscriptions
Fake or News: Self-Driving Cars on Subscription, Crypto Attacks Rising and Working While You Sleep - Théodore Lefèvre
Featured Partners
Related Videos
How we built an AI-powered code reviewer in 80 hours
Yan Cui
The Limits of Prompting: ArchitectingTrustworthy Coding Agents
Nimrod Kor
New AI-Centric SDLC: Rethinking Software Development with Knowledge Graphs
Gregor Schumacher, Sujay Joshy & Marcel Gocke
AI: Superhero or Supervillain? How and Why with Scott Hanselman
Scott Hanselman
Bringing AI Model Testing and Prompt Management to Your Codebase with GitHub Models
Sandra Ahlgrimm & Kevin Lewis
Google Gemini: Open Source and Deep Thinking Models - Sam Witteveen
Sam Witteveen
How AI Models Get Smarter
Ankit Patel
Panel discussion: Developing in an AI world - are we all demoted to reviewers? WeAreDevelopers WebDev & AI Day March2025
Laurie Voss, Rey Bango, Hannah Foxwell, Rizel Scarlett & Thomas Steiner
Related Articles
View all articles



From learning to earning
Jobs that call for the skills explored in this talk.

Forschungszentrum Jülich GmbH
Jülich, Germany
Intermediate
Senior
Linux
Docker
AI Frameworks
Machine Learning


Imec
Azure
Python
PyTorch
TensorFlow
Computer Vision
+1

score4more GmbH
Berlin, Germany
Remote
Intermediate
API
Scrum
React
DevOps
+8

Graphcore
Bristol, United Kingdom
£51K
Junior
C++
PyTorch
Machine Learning

Amazon.com, Inc
Shoreham-by-Sea, United Kingdom
XML
HTML
JSON
Python
Data analysis
+1


