London

June 2–3, 2026

New York

September 15–16, 2026

Berlin

November 9–10, 2026

DORA metrics are lying to you and AI is making it worse

DORA metrics were fine. Then AI showed up.
May 06, 2026

You have 1 article left to read this month before you need to register a free LeadDev.com account.

Estimated reading time: 8 minutes

Key takeaways:

  • DORA metrics can improve while your systems become less understood. Faster delivery creates false confidence when AI is generating code nobody fully comprehends.
  • The blind spot isn’t delivery, its legibility: DORA measures work flowing through the pipe, not whether anyone can explain what’s in it.
  • Stop making DORA carry work it wasn’t built for!

For over a decade, DevOps Research and Assessment (DORA)  metrics have been the closest thing software engineering has to a universal language for measuring delivery performance.

Deployment frequency, lead time for changes, change failure rate, and mean time to recovery – four numbers that could tell you, with reasonable confidence, whether a team was shipping well or struggling. Then AI arrived, and the numbers stopped making sense.

In retrospect, I saw this problem arise before it ever reached a critical system. A junior engineer leaned heavily on AI. Instead of reaching out to senior team members for mentorship, they would go off on their own, prompt an AI assistant, and copy that output into tickets. Moving forward with work that was often wrong, shallow, or incomplete.

They did not have enough depth to evaluate what the system was giving them, and they were too afraid of looking inexperienced to slow down and ask better questions. The work appeared active and responsive, so the problem went largely unaddressed until a critical project was underway.

Then the gap showed itself all at once. There had been visible throughput. There had not been real comprehension. Another side effect was a burden on the senior team members who were reviewing the merge requests (MR) and pushing back. They are doing prompt refining by proxy.

DORA metrics in the AI era

At the organizational scale, DORA metrics conceptually show the same failure mode. They effectively reflect the flow of work through a visible delivery pipe. They don’t capture whether teams possess a working comprehension of the systems impacted by those changes.

DORA metrics alone can produce a false sense of system health and engineering excellence, measuring delivery efficiency without accounting for system understanding. AI didn’t force a new way to fail. However, it shines a light on something that has always existed.

This legibility gap matters most around your critical systems. These are the systems your business and customers depend on. They are the ones that land you in front of regulators. Illegibility in your critical systems poses a risk of harming clients and your business.

Risk is now compounding due to changes outside your control path, as measured by DORA. It’s happening on vendor platforms, identity providers (IdPs), AI agents with delegated authority, and the list keeps growing every day.

Change is happening outside of the traditional software development life cycle (SDLC). Some teams are using agents for code reviews. Do we know anyone who isn’t using one for coding? Some do it for both and just focus on better prompting.

The result is not confusion. It is false confidence. DORA metrics can improve while understanding diminishes and operational risk expands. Your org looks healthy, but your systems are illegible. You can roll back, but you can’t diagnose problems.

Applications, services, and systems become harder to explain. Every code and config change, including a client config, now has increased complexity and risk. Your dashboards show faster delivery and an acceptable change failure rate, while the system becomes more dependent on hidden changes, fragile ownership chains, and condensed knowledge. 

As leaders, we need to start with some basic questions. Who is defining what good DORA metrics look like? Who bears the cost when the metrics improve for the wrong reasons? Where is the blast radius these metrics cover?

These questions put the dashboards into perspective and will support an org committed to excellence over theater.

What are DORA’s blind spots?

DORA grants visibility into change when ownership is traceable, and the surface area for change is known. When a failure’s blast radius extends past the understood consequence boundary, you can no longer rely on the metrics alone for governance.

Critical systems today may depend on third-party IdPs, multiple data flows, a workflow tool, and an AI-driven support or triage layer that sits between the team and real customer impact. In these scenarios, changes are happening in code, vendor consoles, identity rules, etc. Some happen through AI agent glue, with performative delegated authority. No one really wants to admit it is there and is load-bearing.

DORA metrics are incomplete in these situations. On its own, this isn’t the problem. However, pretending they’re giving you the whole story is.

Deployment frequency can rise, while the change failure rate stays flat, even as a vendor rule change quietly alters who clears the authentication layer. By the time the issue appears in the app tier, the measured service is the only place where the consequence becomes visible.

Mean time to resolution (MTTR) also looks excellent when detection is fast, and rollback is faster. I have seen incidents where customers were back up within minutes after a rollback. Nobody on the call could explain why. We found a dark service deployment that disrupted a live critical path.

They insisted they couldn’t possibly have caused the issue, right up until the rollback brought the critical path back up instantly. I’m not sure the dark service team ever satisfactorily explained why. The illegible gap between the dark deploy and the outage is where governance actually lives.

If deployment frequency rises and invisible dependencies shape critical behavior, the numbers are describing only what you can see. They are not false; they are, however, not the entire picture. I have run enough code reviews, retrospectives, and root cause analysis (RCA) where the numbers were green, and my operators were nervous.

I have found it’s always been true that dashboards can be healthy, and systems can be unmanageable. In this case, AI has increased that problem by orders of magnitude.

Signals your metrics are overstating system health

Incidents take longer to diagnose, even when metrics are green. A team may still be shipping quickly, but when something breaks, it takes longer to figure out where the behavior changed, who owns the affected dependency, or which change actually mattered.

Meanwhile, important changes happen outside the systems you review. When system behavior changes due to items that never make it to review, your dashboards are missing the risk.

One clear signal that a team is straddling illegibility is that they are moving quickly but struggle to explain the end-to-end critical path. Their code and designs look mature in a slide deck and in code reviews, yet still depend on a handful of people carrying the real model in their heads. The bill comes due at the next attrition event, or the incident’s blast radius extends beyond what you can measure.

None of these signs breaks DORA. They mean the DORA contract is too weak for the system you are running.

How to make DORA more honest

Include a detailed scope note to every DORA review for a critical system. At a minimum, explicitly define the metrics. For the best context, keep it on the same document as the definitions. Add your risk profile and tolerance, the dependency chain, known limitations, and risks for out-of-band changes.

These details are the beginning of bringing DORA metrics back into alignment and treating them as the scoped instrument rather than a complete story. As a tool, they are useful. Relying on them alone leaves you blind.

To promote rigorous management, it is imperative to establish clear action steps for evaluating changes in metrics. For instance, if deployment frequency decreases, examine dependency load and the handoff chain before instructing teams to increase their speed. 

Similarly, if the change failure rate rises, conduct a thorough review of the dependency chain to identify where loss of change capture is occurring before attributing the issue to the most recent deployment. Consider what aspects are outside your review process and may be contributing to an expanded blast radius.

If MTTR looks great but the team still can’t explain the failure, your systems are illegible. Ask how many people can walk you through the system end-to-end, in plain language, in under five minutes. This method ensures that any movement in metrics prompts thoughtful, systemic inquiry rather than superficial corrective measures.

Digging here isn’t a shamefest. It’s critical to understand if knowledge is shared or concentrated. If the answer is two or fewer, that is not a knowledge concentration problem. It is a succession crisis. These checks work until they become routine. When the scope note stops changing between quarters, that is the signal to pressure-test it again.

LDX3 London 2026 agenda is live - See who is in the lineup

Manage push backs

Someone in the room will push back. Everything looks good, and incidents are down. Why do we need to add a process and take time away from development?

DORA measures delivery, not comprehension. Without the additional process, all we know is that the pipe is flowing. We don’t know whether anyone can explain what is in the pipe or what happens when it breaks somewhere that the dashboard doesn’t see.

For critical systems, require every DORA review to include at least one recent incident narrative and a brief note on what changed afterward in ownership, operating practices, or risk posture. You need to do the same for team performance. DORA can inform conversations about delivery trends. It should not, on its own, be used to judge team quality, support promotion decisions, or serve as a substitute for understanding the system.

Pick one critical system and run this tighter review pattern for a quarter. After that quarter, the real question is whether leaders are making better decisions about ownership, risk, and system understanding, not just showing better charts.

DORA metrics are not useless, and the goal should not be to replace them. We need to stop making them carry work they cannot support. Now more than ever, with AI-heavy systems, code delivery, and system understanding are not equal. Leaders who understand that reality early can change the contract, before the next ugly incident does it for them.