London

June 2โ€“3, 2026

New York

September 15โ€“16, 2026

Berlin

November 9โ€“10, 2026

AI-generated code passes far more automated tests than human

Humans are notoriously hard to please.
March 13, 2026

This is your last article that you can read this month before you need to register a free LeadDev.com account.

Estimated reading time: 4 minutes

Key takeaways:

  • AI models frequently pass automated SWE-bench tests, yet human maintainers reject that same code for failing to meet professional standards.
  • While AI is getting better at avoiding syntax errors, it still struggles with “soft” requirements like proper coding style, repository standards, and maintaining complex project logic.
  • Relying solely on automated scores leads the industry to significantly overestimate the readiness of models like GPT-5 and Claude for real-world production environments without human oversight.

A new METR study highlights how automated code review for AI-generated code might not be ready for prime time.

METRโ€™s evaluations of AI coding tools and large language models (LLMs) have become the industryโ€™s yardstick.

Now, a new study reveals that passing automated unit tests doesnโ€™t necessarily mean AI-generated code would meet industry standards.

Depending on the model, between half and two-thirds of AI-generated pull requests that successfully pass the popular SWE-bench automated grader โ€“ which checks whether a patch passes a repositoryโ€™s existing unit tests โ€“ would be rejected by human repository maintainers, the new research shows.

To establish a normal standard, the researchers slipped known-good, human-written patches into the blind review process. While 100% of these verified human patches passed the automated tests, strict human reviewers approved just 68% of them. But with AI agents, the drop-off was much more severe.

The study enlisted active maintainers from major open-source projects, including scikit-learn, Sphinx, and pytest, to blindly review 296 AI-generated patches and decide whether or not to merge the code into production. Across the various maintainers surveyed, the average merge rate was 24.2 percentage points lower than the automated benchmark scores suggested.

This suggests automated benchmarks that tested models released between mid-2024 and late 2025, including Anthropicโ€™s Claude series and GPT-5, significantly inflate their perceived capabilities to deliver safe, error-free code.

Chiming with lived experience

โ€œTheir latest study actually makes a lot of sense for me because this matches my experience with LLMs,โ€ says Andrei Maxim, a Ruby developer based in Romania. Maxim has used Claude Code since May 2025 on a variety of different projects.

While he had seen the coding tools improve their likelihood of making silly errors โ€“ such as putting a block in the wrong place and creating syntax errors โ€“ he hadnโ€™t seen a doubling of code quality like the METR results would make you believe happened in the leap from Claude Opus 4.5 to 4.6.

When maintainers did reject the AI-generated code, they frequently cited poor code quality, such as bad style or failing to meet repository standards, alongside breakages in unrelated code and core functionality bugs. While, as Maxim points out, newer models like Claude 4.5 Sonnet have largely moved past core functionality failures, these findings suggest they still struggle significantly to produce code that meets human quality standards.

The findings offer a stark warning: relying on a naive interpretation of automated benchmark scores could lead the industry to severely overestimate how useful AI coding agents are in real-world workflows.

Ready for prime time?

โ€œThe main thing I take from this is that humans โ€“ the project maintainers โ€“ will always take a more subjective view of PRs than a machine,โ€ says Simon Ritter, deputy chief technical officer at Azul, a cloud platform for Java.

Looking at the results, Ritter points out that the rejection reason by maintainers is rarely for breaking other usersโ€™ code. โ€œThis would indicate to me that, when the AI agent produces a patch that passes the automatic grader, it does so in a way that works for the project,โ€ he explains. โ€œWhere there are rejections, it would indicate that there may not be sufficient tests,โ€ Ritter adds. 

A human project maintainer will be able to spot something that breaks other code in situations like this because of their in-depth knowledge of the whole project codebase.

Mountains from molehills?

Ritter also suggests that even the number of failures attributed to code quality are in some ways personal opinion. โ€œThis again is most likely a result of the maintainer’s subjectivity,โ€ he says. โ€œWhat is good or bad code style comes down to the maintainer’s opinion.โ€

Ritter remains bullish on the idea of AI-supported coding: โ€œUltimately, AI coding agents provide a valuable tool to improve the efficiency of software developers. However, weโ€™re still some way from eliminating the need for them completely, if ever.โ€ And the findings suggest that something is missing between what AI coding tools produce, and what human maintainers want.

LDX3 New York lineup

That might require METR to re-evaluate its methods, reckons Maxim. โ€œThere’s one very specific thing I’d wish the people at METR would do,โ€ he says. โ€œThey currently test which tasks an LLM can finish successfully 50% of the time, but in reality that’s not what happens. What I normally do is ask the LLM to do a task and it runs, then I verify the result.โ€ 

Any evaluation of AIโ€™s code creation abilities should work on the basis of doing it right in the first place, rather than getting it right half of the time.