This is your last article that you can read this month before you need to register a free LeadDev.com account.
Estimated reading time: 4 minutes
Key takeaways:
- AI models frequently pass automated SWE-bench tests, yet human maintainers reject that same code for failing to meet professional standards.
- While AI is getting better at avoiding syntax errors, it still struggles with “soft” requirements like proper coding style, repository standards, and maintaining complex project logic.
- Relying solely on automated scores leads the industry to significantly overestimate the readiness of models like GPT-5 and Claude for real-world production environments without human oversight.
A new METR study highlights how automated code review for AI-generated code might not be ready for prime time.
METRโs evaluations of AI coding tools and large language models (LLMs) have become the industryโs yardstick.
Now, a new study reveals that passing automated unit tests doesnโt necessarily mean AI-generated code would meet industry standards.
Depending on the model, between half and two-thirds of AI-generated pull requests that successfully pass the popular SWE-bench automated grader โ which checks whether a patch passes a repositoryโs existing unit tests โ would be rejected by human repository maintainers, the new research shows.
To establish a normal standard, the researchers slipped known-good, human-written patches into the blind review process. While 100% of these verified human patches passed the automated tests, strict human reviewers approved just 68% of them. But with AI agents, the drop-off was much more severe.
The study enlisted active maintainers from major open-source projects, including scikit-learn, Sphinx, and pytest, to blindly review 296 AI-generated patches and decide whether or not to merge the code into production. Across the various maintainers surveyed, the average merge rate was 24.2 percentage points lower than the automated benchmark scores suggested.
This suggests automated benchmarks that tested models released between mid-2024 and late 2025, including Anthropicโs Claude series and GPT-5, significantly inflate their perceived capabilities to deliver safe, error-free code.
Your inbox, upgraded.
Receive weekly engineering insights to level up your leadership approach.
Chiming with lived experience
โTheir latest study actually makes a lot of sense for me because this matches my experience with LLMs,โ says Andrei Maxim, a Ruby developer based in Romania. Maxim has used Claude Code since May 2025 on a variety of different projects.
While he had seen the coding tools improve their likelihood of making silly errors โ such as putting a block in the wrong place and creating syntax errors โ he hadnโt seen a doubling of code quality like the METR results would make you believe happened in the leap from Claude Opus 4.5 to 4.6.
When maintainers did reject the AI-generated code, they frequently cited poor code quality, such as bad style or failing to meet repository standards, alongside breakages in unrelated code and core functionality bugs. While, as Maxim points out, newer models like Claude 4.5 Sonnet have largely moved past core functionality failures, these findings suggest they still struggle significantly to produce code that meets human quality standards.
The findings offer a stark warning: relying on a naive interpretation of automated benchmark scores could lead the industry to severely overestimate how useful AI coding agents are in real-world workflows.
More like this
Ready for prime time?
โThe main thing I take from this is that humans โ the project maintainers โ will always take a more subjective view of PRs than a machine,โ says Simon Ritter, deputy chief technical officer at Azul, a cloud platform for Java.
Looking at the results, Ritter points out that the rejection reason by maintainers is rarely for breaking other usersโ code. โThis would indicate to me that, when the AI agent produces a patch that passes the automatic grader, it does so in a way that works for the project,โ he explains. โWhere there are rejections, it would indicate that there may not be sufficient tests,โ Ritter adds.
A human project maintainer will be able to spot something that breaks other code in situations like this because of their in-depth knowledge of the whole project codebase.
Mountains from molehills?
Ritter also suggests that even the number of failures attributed to code quality are in some ways personal opinion. โThis again is most likely a result of the maintainer’s subjectivity,โ he says. โWhat is good or bad code style comes down to the maintainer’s opinion.โ
Ritter remains bullish on the idea of AI-supported coding: โUltimately, AI coding agents provide a valuable tool to improve the efficiency of software developers. However, weโre still some way from eliminating the need for them completely, if ever.โ And the findings suggest that something is missing between what AI coding tools produce, and what human maintainers want.

New York • September 15-16, 2026
Speakers Gergely Orosz, Will Larson and Frances Thai confirmed ๐
That might require METR to re-evaluate its methods, reckons Maxim. โThere’s one very specific thing I’d wish the people at METR would do,โ he says. โThey currently test which tasks an LLM can finish successfully 50% of the time, but in reality that’s not what happens. What I normally do is ask the LLM to do a task and it runs, then I verify the result.โ
Any evaluation of AIโs code creation abilities should work on the basis of doing it right in the first place, rather than getting it right half of the time.