You have 1 article left to read this month before you need to register a free LeadDev.com account.
Estimated reading time: 10 minutes
Key takeaways:
- Your metrics are lying to you. Deployment frequency, cycle time, PR volume – AI inflates all of them without improving engineering quality.
- Goodhart’s Law is now unavoidable: gaming metrics used to require effort. With AI, it’s a side effect of normal tool use.
- Three metrics still hold up! If your dashboard isn’t built around these, rebuild it.
Most software engineering metrics were built on the assumption that human effort and output move roughly in proportion to each other. AI has shifted that belief.
For years, teams have relied on the same set of proxies – deployment frequency, cycle time, lines of code, pull request (PR) volume – to answer a deceptively simple question: is engineering actually working?
These metrics were never perfect. They were approximations, built on the assumption that human effort and human output moved roughly in proportion to each other. When that assumption held, the proxies were good enough. Then came the explosion of AI-coding tools.
Now a developer can generate a working PR in minutes, deployment frequency stops signalling team health and starts signalling tool adoption.
If an agent produces 400 lines of syntactically correct code that addresses the symptom rather than the root cause, cycle time compresses while technical debt accumulates.
When test coverage is written by the same model that wrote the code, the number means something different than it used to.
Metrics were always measuring effort as value, but AI has made effort nearly free. What’s left is the harder question your dashboards were never designed to answer: are we building the right things, in the right way, with the right outcomes in mind?
Let’s look at ten metrics that are breaking down in the AI-coding era, why, and what engineering leaders should be thinking about instead. Some are part of common frameworks, others are standalone measurements.
Your inbox, upgraded.
Receive weekly engineering insights to level up your leadership approach.
DORA metrics
We’ll start with three DevOps Research and Assessment (DORA) metrics. DORA metrics were built on the assumption that delivery speed and reliability signal good engineering practices underneath.
1. Deployment frequency
Deployment frequency measures how often a team successfully deploys code to production, traditionally used as a signal of engineering agility and healthy delivery practices. Frequent deploys signal small good test coverage and fast feedback loops. A high number of deployments meant the underlying practices were working.
AI breaks that logic. A single engineer with an agent can quickly ship multiple PRs without any of those practices having improved. Deployment frequency goes up, but what it’s measuring has changed from engineering maturity to tool adoption. The two are not the same.
“Output is now nearly free, so volume measures the tool and not the engineer,” James Stanier, CTO at Nordhealth, tells LeadDev. “The scarce resource has moved from writing code to reviewing and shipping it.”
2. Lead time for changes
Lead time for changes measures the time from commit to production, with shorter traditionally meaning fewer bottlenecks and tighter feedback loops. It is more generally referred to as cycle time.
AI compresses this so dramatically it stops being informative. When an agent produces a working implementation in minutes, lead time collapses. What’s worse, artificially short lead times mask the real bottleneck that has emerged: the time it takes a human to understand and verify code they didn’t write.
“What I have also seen is that a faster cycle time meant the code was shipped with AI all along and little to no manual review, which increased the defect escape rate,” Pratik Mistry, EVP of technology consulting at Radixweb, tells LeadDev.
“This is what makes cycle time increasingly less useful as a standalone metric in the AI era. Unless you dive deeper and look at the supporting metrics as well, it can tell a very incomplete story.”
3. Change failure rate
Change failure rate measures the percentage of deployments that cause a production failure. It’s used as a signal of code quality and release robustness. However, AI makes it harder to pinpoint exactly what went wrong and why.
When AI writes the code and a human approves it without fully understanding it, accountability breaks down. Continuous integration (CI) passes, the change ships, and then something fails. Was it the agent, the reviewer, or the process? Change failure rate tells you a failure happened, but it can no longer tell you why because it becomes unclear whether the fault lies with the agent that generated the code, the reviewer who approved it, or the process that allowed it through.
The SPACE framework
Now let’s cast an eye over a couple of metrics from The SPACE framework. This was designed to give engineering leaders a more human picture of productivity, one that went beyond activity and velocity to include how engineers actually experience their work. That breadth is also what makes it vulnerable to AI distortion.
4. Satisfaction
Satisfaction measures whether engineers find their work meaningful and rewarding, typically relying on survey-based self-reporting. The problem is that scores may be rising for the wrong reasons. Engineers feel more productive with AI thanks to less boilerplate and faster starts, but relief isn’t the same as meaningful work.
If the most cognitively engaging parts of the job are being absorbed by agents, satisfaction scores may mask a hollowing out of the work itself. You can feel relieved and disengaged at the same time.
5. Activity
Activity metrics such as commits, PR volume, code churn, and tickets closed were always a crude proxy for productivity. AI makes them meaningless.
A single engineer with an agent can generate more commits in a day than a team used to produce in a sprint. Activity metrics were imperfect before AI. Now they are actively misleading.
“I have a friend who works on their company’s docs site,” James Socol, senior staff engineer, tells LeadDev. “When a lot of people started making AI-created changes, the rate of change went way up. Then people complained that the really good docs had gotten much harder to navigate. A perfect example of velocity not correlating with value.”
6. Efficiency and flow
Efficiency and flow measures how freely engineers can work without interruption or friction. AI creates a convincing imitation of flow. When work moves faster and blockers disappear, the sense of momentum is real. Yet, underneath it, a different kind of debt accumulates – expertise.
Engineers are making less contact with the hard parts of the system – the constraints, tradeoffs, and architectural decisions that build expertise over time. You can ship faster while understanding your system less deeply, and the metric will not tell you that is happening.
More like this
Now for some other metrics, starting with a famous one.
7. Lines of code
Lines of code measures raw output volume in terms of how much code was written. It’s a metric that has always been widely discredited because it incentivizes the wrong behaviors: verbose code scores higher than elegant code, and deleting bad code (often the most valuable thing an engineer can do) registers as negative productivity.
AI makes it even harder to justify. An engineer with an agent can generate thousands of lines in minutes, severing any remaining correlation between line count and engineering value. Worse, the most valuable work in the AI era like refactoring, simplifying, and deleting bad AI-generated code registers as zero or negative productivity.
“We’ve observed AI-generated code to be +20% larger on average than human-generated code, and given how fast and easily AI-coding tools and agents can generate code for many purposes, it’s essentially an input metric, correlating highly with token spend,” says Nicholas Arcolano, head of AI and research at Jellyfish.
8. DevEx/DXI frameworks
Developer Experience (DevEx) and Developer Experience Index (DXI) measure the quality of the environment engineers work in, not just what they produce. DevEx looks at three dimensions: how quickly engineers get feedback on their work, how much mental effort is required to get things done, and how often they can work without interruption. DXI turns those dimensions into a scored index, giving leaders a quantitative way to track whether the developer experience is improving over time.
AI has decoupled many of these experiences from actual output quality. Engineers might feel more productive because friction has been removed. This shows up in survey scores, but an engineer can feel highly productive while spending most of their time steering and correcting AI output rather than producing work they fully own. Perceived productivity scores may be rising at exactly the moment output quality is becoming hardest to assess.
Goodhart’s Law has never been so relevant
Goodhart’s Law is the principle that when a measure becomes a target, it ceases to be a good measure. It dictates that once a metric is used as a specific goal for evaluation or incentives, people will manipulate the system to hit that metric, rendering it useless
Goodhart’s Law was already a chronic problem in software engineering. AI has made it inescapable. In the pre-AI era, gaming a metric required deliberate effort. Now it’s a side effect of normal tool use.
Tokenmaxxing is a prime example. Tokenmaxxing is the practice of deliberately maximizing AI token consumption to game productivity metrics, rather than to produce better software. It’s Goodhart’s Law applied to AI usage. Once token spend became a target in some organizations, engineers optimized for the metric rather than the outcome.
Every metric that tracks volume, speed, or frequency can now be inflated effortlessly by an agent without any corresponding improvement in what they were designed to measure. The inflation isn’t obviously fraudulent. The code exists, the tests run, and the reviews happen. The judgment those signals were designed to indicate may be entirely absent.
Three engineering metrics surviving AI
The metrics that survive the AI era will share one characteristic: they measure what happened as a result of engineering work, not how much engineering work happened.
1. Time to recover
There’s one DORA metric standing firm in the face of AI. Time to recover (also known as Mean Time to Recovery or MTTR) measures how quickly a team recovers from a production incident or outage, which is a problem that AI doesn’t fundamentally change.
If anything, AI-assisted debugging may modestly improve it over time. For now it remains one of the more reliable signals of engineering health in the DORA framework.
2. Business and customer outcome metrics
The metric AI cannot fake is whether the work delivered value. Did the feature get used? Did the change reduce error rates? Did the deployment improve customer outcomes? These are harder to measure and slower to move, but they are the closest thing to a direct measure of engineering value rather than a proxy for it.
As AI makes activity metrics increasingly meaningless, outcome-based measurement becomes more than just better practice – it becomes the only reliable approach.
3. Escaped defect rate
Escaped defect rate gauges the proportion of bugs that reach production having passed your verification layers. It measures outcome rather than activity, and AI cannot inflate it without consequence.
As code volume increases and human review becomes shallower, the rate at which defects slip through is one of the clearest signals of whether your verification systems are keeping pace with your generation speed.

New York • September 15 & 16, 2026
Delivering AI results without a playbook?
Find what’s working at LDX3
Rethinking software engineering metrics in 2026
Engineering metrics built for human-paced development are breaking down. AI has severed the link between effort and output, inflating deployment frequency, cycle time, and PR volume without improving quality or outcomes.
“In the rush to quantify and optimize every aspect of software development, our industry forgot that metrics were never supposed to measure productivity, individual performance, or compare teams. They were always meant as a feedback mechanism.” says Ankit Jain, CEO, Aviator and founder of The Hangar community.
If Jain were tasked with designing delivery or productivity metrics from scratch at an engineering organization, here’s how he’d approach it:
- Gather requirements: “Don’t dive in and measure just to measure. Every engineering work starts with gathering requirements,” he says. In 2026 engineering leaders also need answers to AI-era questions like how many engineers are using AI tools, what are they using them for and how, how to scale AI adoption, the risk of shipping AI slop to production, rising rework, and code review queues negating the productivity gains of faster code generation.
- Talk to management: metrics don’t live in a vacuum. “Management buy-in and support are essential; to achieve your goals, you may need a budget for tooling, infrastructure, or hiring,” Jain says.
- Talk to engineering: this step is often skipped precisely because metrics are not viewed as feedback mechanisms for teams to improve. “Talk to engineering and ask them how they would tackle the problem. Don’t walk in with a top-down directive,” Jain adds.
- Pair metrics with guardrails: pick guardrail metrics that don’t depend on a human catching everything.
- Drop or deprioritize outdated metrics: again, metrics are a feedback mechanism. “Once you’ve achieved the desired adoption level, you can shift that metric in the background and just monitor if it falls below a certain level.”