London

June 2–3, 2026

New York

September 15–16, 2026

Berlin

November 9–10, 2026

AI coding made us faster. Why did incidents increase?

Faster code. Slower, harder incidents.
May 19, 2026

You have 1 article left to read this month before you need to register a free LeadDev.com account.

Estimated reading time: 7 minutes

Key takeaways:

  • AI amplifies whatever your delivery process already is. Strong practices get faster. Weak ones get paged at midnight.
  • The self-grading loop is your biggest hidden risk: when the model writes code and tests, it tests its own assumptions.
  • Observability is a release gate, not an afterthought.

Across engineering organizations, a pattern has become consistent. After rolling out an AI-coding assistant, velocity metrics improve quickly: pull requests (PRs) get bigger, cycle times shorten, and sprint records fall. Then, a few months in, the on-call rotation gets brutal. More tooling, faster shipping, and worse reliability. These are connected, not coincidental.

The DevOps Research and Assessment (DORA) 2024 report confirmed the pattern across the industry. Teams with significantly higher AI adoption also showed a higher change failure rate, meaning more deployments requiring hotfixes or causing production failures.

However, that finding cuts both ways. One pattern worth noting: teams that had already adopted contract testing and canary deployments before introducing AI tended to see their change failure rate fall rather than rise, because the AI-accelerated changes were already well-constrained.

Contract testing verifies that services interact correctly at their boundaries, while canary deployments release changes to a small percentage of traffic before rolling out fully. The tools themselves are not the problem. The delivery system around them is.

Three failure patterns explain most of the gap between those two outcomes. These are not new problems. They are old ones running faster.

3 failure patterns

1. Polished code fools reviewers

AI-generated code looks right. It follows naming conventions, respects linting rules, and reads like something a senior engineer would produce. PRs that would previously have prompted 20 are getting approved in 15. Often, reviewers pattern-match to familiar structure and skip the hard reasoning: side effects, edge cases, business logic that only makes sense if you understand the domain.

In practice, a model can produce a wrong implementation with the same fluency as a correct one. A reviewer under time pressure may not catch the difference.

2. When AI marks its own homework

When the same model writes code and then writes tests, but the tests cover what the model produced, not what the requirements are. Coverage turns green. Edge cases nobody described remain untested. Worse, AI-generated tests frequently verify implementation details rather than observable behavior. Refactor the code later and the tests need rewriting too. The suite slows every future change while providing none of the safety it implied.

3. AI cannot see the whole system

Every model works on the code it is shown. It has no awareness of the broader system: the shared retry queue, the upstream producer that sends late-arriving events, the implicit reliability guarantee held together by a design decision from three years ago. A change that looks like a clean refactor can quietly remove something critical. Those failures do not appear in unit tests. They appear on the on-call rotation.

None of these patterns were introduced by AI. Overconfident review, shallow test coverage, and implicit system assumptions predate coding assistants by years. What AI does is amplify whatever is already there, for good and bad. The question is whether your delivery process is worth amplifying.

A quality-first operating model

The answer is not to slow AI adoption. It is to redesign the delivery process so that speed and reliability reinforce each other. Three principles make the biggest practical difference.

1. Write the spec before you write the prompt

Before prompting the model to write any code, document the expected behavior in plain English: what the code should do, what inputs it handles, and what happens when things go wrong. Two or three sentences are enough. The AI then writes tests against that spec, and the implementation to satisfy those tests, breaking the self-grading loop described above.

Many teams resist this as an extra step. Writing a formal spec used to mean tickets, acceptance criteria, and a refinement session. Writing two sentences before pasting a prompt takes four minutes. The cost of documenting intent has dropped far enough that engineers actually do it, which is the part that made test-driven development (TDD) hard to adopt for 20 years.

Figure 1 (below) contrasts the two workflows: the old pattern where the model writes code and then tests against what it just produced, and the spec-first approach where intent precedes every line of generated code.

Left: the self-grading loop: AI writes code and tests, testing its own assumptions.
Right: spec-first flow: intent is documented before the model is invoked, breaking the loop.

2. Tier changes by risk and enforce contract test

Before AI touches authentication, payment logic, or data model changes, require a human to write the spec and a second reviewer to validate the business logic explicitly. Use feature flags to decouple deployment from release.

Feature flags are toggles that let you ship code without activating it for users yet. For anything touching an external integration, require a contract test that validates actual application programming interface (API) behavior against the live endpoint, not a mock. 

Many teams also find mutation testing scores useful as a merge gate. Mutation testing deliberately introduces small faults into code to confirm that tests actually catch them. If they do not, they are testing implementation rather than behavior.

3. Treat observability as a release gate

For medium and high-risk changes, define which metrics should be affected before deploying. Use canary rollout with automated rollback if error rate or latency crosses a threshold. Require a linked monitoring dashboard before any production-path PR can merge. If the author cannot point to a dashboard, the change is not done.

How one wrong field took down a transaction service at peak load

The following scenario illustrates a class of failure that recurs across engineering teams when AI-assisted development outpaces the controls around it.

The call came at 11.30 pm during a peak transaction window. A FinTech team’s transaction processing service had gone quiet in the worst possible way: not crashing, not throwing visible errors, just silently dropping outbound disbursements. Recipients were seeing no confirmation; funds were not arriving. This was the kind of night when silence from a financial service means something has gone badly wrong.

Earlier that week, an AI assistant had helped refactor the service. The model correctly identified that adding an idempotency key, which is a unique token sent with each request to prevent duplicate processing if the same request is retried, would improve reliability. It generated the change and placed the key in the request header. The downstream service provider only accepted it in the request body. Every unit test passed because the external API was mocked.

The engineer who traced it described a specific feeling: the code looked completely right. Clean, well-commented, the kind you would point to in a review as an example of doing it properly. The bug was not in the logic. It was in an assumption about a third-party API the model had never actually called.

Three gaps made it possible: the refactor had not been risk-tiered, no contract test had validated the integration against the live API, and no canary had monitored error rates during rollout. Any one of those controls would have caught it within minutes. None existed.

The model wrote correct code for the problem as it understood it. The problem was incompletely specified and nobody had built the system to check.

LDX3 New York lineup

What changed, and what to do next

Across teams I have seen adopt these controls, incident frequency has typically dropped by roughly a third within two quarters. Recovery time (mean time to recovery, or MTTR) tends to improve more than raw incident count, because defining metrics before deployment builds the dashboards needed for fast diagnosis.

On-call fatigue also reduced, though that is harder to measure than incident counts. Engineers tend to dread release days less. The tooling has not changed. The process has.

AI will amplify whatever your delivery system already is. In practice, and consistent with the DORA amplifier finding, teams that had strong practices before adopting AI tended to get faster. Teams that did not got a wake-up call late at night, at precisely the moment their system could least afford it. The good news is the gaps are fixable and the fixes do not require slowing down.

3 things to start this week

  1. Adopt spec-first prompting for critical changes. Require two or three sentences of intended behavior before any AI-assisted PR is opened on a production-path service.
  2. Classify AI-generated PRs by risk. Changes touching payment, authentication, or data models require human business-logic review, a contract test against the live API, and a documented rollback path.
  3. Treat observability as a release gate. No production-path merge without a linked monitoring dashboard. Measure success with change failure rate and MTTR – both should fall within two quarters.