London

June 28–29, 2027

New York

September 15–16, 2026

Berlin

November 9–10, 2026

Lessons from 100 P0 incidents

Hard-won patterns for designing systems and leading teams to detect failures sooner, respond faster, and limit the blast radius.

Register or log in to access this video

Create an account to access our free engineering leadership content, free online events and to receive our weekly email newsletter. We will also keep you up to date with LeadDev events.

Register with google

We have linked your account and just need a few more details to complete your registration:

Terms and conditions

 

 

Enter your email address to reset your password.

 

A link has been emailed to you - check your inbox.



Don't have an account? Click here to register
June 03, 2026

Practical lessons from handling 100 P0 production incidents over 20 years, focusing on system design, incident response, observability, and leadership decisions that reduce impact and recovery time.

Over the past 20 years, I have been directly involved in handling more than 100 P0 production incidents across large-scale, distributed systems. These incidents included full outages, severe performance degradation, and data integrity failures, often under significant time pressure and business impact.

This talk shares the most important lessons learned from being on the front line of those incidents. Rather than walking through individual war stories, it focuses on the recurring technical and organizational patterns that consistently shaped outcomes, both good and bad.

I will explain how early system design decisions, alerting quality, ownership boundaries, and operational practices influence incident severity long before anything breaks. The talk also examines what actually helps teams stay effective during high-stress incidents and what tends to increase confusion, delay recovery or lead to burnout.

Drawing from real experience as a hands-on engineer and technical leader, this session offers practical guidance on improving incident response, building more resilient systems and leading teams through P0 situations with clarity and confidence. The goal is not to eliminate incidents entirely, but to reduce their impact and help teams respond better when failures inevitably occur.

Key takeaways

  1. Recognize recurring patterns behind high-severity production incidents.
  2. Design systems and alerts that surface problems earlier and more clearly.
  3. Improve incident response by reducing cognitive load during P0 events.
  4. Run incident reviews that lead to meaningful system and process changes.
  5. Lead teams through production failures without blame or burnout.