AI-generated code passes far more automated tests than human

Humans are notoriously hard to please.

By Chris Stokel-Walker

March 13, 2026

You have 1 article left to read this month before you need to register a free LeadDev.com account.

Estimated reading time: 4 minutes

Key takeaways:

AI models frequently pass automated SWE-bench tests, yet human maintainers reject that same code for failing to meet professional standards.
While AI is getting better at avoiding syntax errors, it still struggles with “soft” requirements like proper coding style, repository standards, and maintaining complex project logic.
Relying solely on automated scores leads the industry to significantly overestimate the readiness of models like GPT-5 and Claude for real-world production environments without human oversight.

A new METR study highlights how automated code review for AI-generated code might not be ready for prime time.

METR’s evaluations of AI coding tools and large language models (LLMs) have become the industry’s yardstick.

Now, a new study reveals that passing automated unit tests doesn’t necessarily mean AI-generated code would meet industry standards.

Depending on the model, between half and two-thirds of AI-generated pull requests that successfully pass the popular SWE-bench automated grader – which checks whether a patch passes a repository’s existing unit tests – would be rejected by human repository maintainers, the new research shows.

To establish a normal standard, the researchers slipped known-good, human-written patches into the blind review process. While 100% of these verified human patches passed the automated tests, strict human reviewers approved just 68% of them. But with AI agents, the drop-off was much more severe.

The study enlisted active maintainers from major open-source projects, including scikit-learn, Sphinx, and pytest, to blindly review 296 AI-generated patches and decide whether or not to merge the code into production. Across the various maintainers surveyed, the average merge rate was 24.2 percentage points lower than the automated benchmark scores suggested.

This suggests automated benchmarks that tested models released between mid-2024 and late 2025, including Anthropic’s Claude series and GPT-5, significantly inflate their perceived capabilities to deliver safe, error-free code.

Your inbox, upgraded.

Receive weekly engineering insights to level up your leadership approach.

Chiming with lived experience

“Their latest study actually makes a lot of sense for me because this matches my experience with LLMs,” says Andrei Maxim, a Ruby developer based in Romania. Maxim has used Claude Code since May 2025 on a variety of different projects.

While he had seen the coding tools improve their likelihood of making silly errors – such as putting a block in the wrong place and creating syntax errors – he hadn’t seen a doubling of code quality like the METR results would make you believe happened in the leap from Claude Opus 4.5 to 4.6.

When maintainers did reject the AI-generated code, they frequently cited poor code quality, such as bad style or failing to meet repository standards, alongside breakages in unrelated code and core functionality bugs. While, as Maxim points out, newer models like Claude 4.5 Sonnet have largely moved past core functionality failures, these findings suggest they still struggle significantly to produce code that meets human quality standards.

The findings offer a stark warning: relying on a naive interpretation of automated benchmark scores could lead the industry to severely overestimate how useful AI coding agents are in real-world workflows.

Ready for prime time?

“The main thing I take from this is that humans – the project maintainers – will always take a more subjective view of PRs than a machine,” says Simon Ritter, deputy chief technical officer at Azul, a cloud platform for Java.

Looking at the results, Ritter points out that the rejection reason by maintainers is rarely for breaking other users’ code. “This would indicate to me that, when the AI agent produces a patch that passes the automatic grader, it does so in a way that works for the project,” he explains. “Where there are rejections, it would indicate that there may not be sufficient tests,” Ritter adds.

A human project maintainer will be able to spot something that breaks other code in situations like this because of their in-depth knowledge of the whole project codebase.

Mountains from molehills?

Ritter also suggests that even the number of failures attributed to code quality are in some ways personal opinion. “This again is most likely a result of the maintainer’s subjectivity,” he says. “What is good or bad code style comes down to the maintainer’s opinion.”

Ritter remains bullish on the idea of AI-supported coding: “Ultimately, AI coding agents provide a valuable tool to improve the efficiency of software developers. However, we’re still some way from eliminating the need for them completely, if ever.” And the findings suggest that something is missing between what AI coding tools produce, and what human maintainers want.

New York • September 15 & 16, 2026

No playbook. Real pressure.

Find what works at LDX3 New York

Explore

That might require METR to re-evaluate its methods, reckons Maxim. “There’s one very specific thing I’d wish the people at METR would do,” he says. “They currently test which tasks an LLM can finish successfully 50% of the time, but in reality that’s not what happens. What I normally do is ask the LLM to do a task and it runs, then I verify the result.”

Any evaluation of AI’s code creation abilities should work on the basis of doing it right in the first place, rather than getting it right half of the time.

About the author

Chris Stokel-Walker

Chris Stokel-Walker is a freelance journalist based in the UK.
- @stokel

Newsletters

Panel discussions

Videos

Reports

For you

London

Meetups

New York

Berlin

AI-generated code passes far more automated tests than human

By Chris Stokel-Walker

Your inbox, upgraded.

Chiming with lived experience

More like this

Ready for prime time?

Mountains from molehills?

About the author

Chris Stokel-Walker

London

Meetups

New York

Berlin

AI-generated code passes far more automated tests than human

By Chris Stokel-Walker

Your inbox, upgraded.

Chiming with lived experience

More like this

Ready for prime time?

Mountains from molehills?

Share:

About the author

Share:

More like this