Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

How to run a great incident post-mortem

Implementing a blameless culture
March 29, 2021

You have 1 article left to read this month before you need to register a free LeadDev.com account.

In building software, continuous learning and blameless culture are mainstays of high-performing teams.

From Accelerate, to the Agile Manifesto, to Google’s research, psychological safety and learning from mistakes are the key attributes of great teams. But what about when things go seriously wrong? How do strong teams overcome these failures in one piece? The answer lies within ‘the blameless post-mortem’: a process nicely defined by Google’s SRE culture guide as ‘focused on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior’.

These contributing causes that we seek to identify are ones that stem from the system in which we work, and not from any individual actor. ‘The system’ encompasses everything from the cultural climate, to the tooling and interfaces, to the technology, to the business ecosystem. It’s everything that, when taken into account, causes a person (or team) to make a certain decision.

An effective, blameless incident post-mortem meeting is a powerful tool in our box as engineering managers because it allows us to uncover the systematic reasons for failures, and thus refine our software, processes, tools, and culture so that the same failure is less likely to happen again. Over time, this process helps leaders to design systems in which the rational, expedient choice at the time, is also the safe or ‘correct’ choice in hindsight.

What makes a post-mortem successful?

A productive incident is one that results in a change to the system or environment in which people work, that enables and supports the people within it to not make the same ‘mistake’ again.

I put ‘mistake’ in quotation marks because the decision did not look like a mistake to the team at the time it was made. The most important thing to understand about post-mortems is that the ‘mistakes’ are in fact the product of rational, well-intentioned people, making the best possible decision given the system they are in.

This is critical – because if you blame human error, e.g. ‘Alice made a mistake here and Bob there, and they need to do better,’ then folks in the future will go on to make that exact same error the next time the same circumstances arise.

Sidney Dekker refers to this as the ‘bad apple’ theory: if you remove the person or people causing the problem then you’re left with a perfect system. This is just not true. In reality, your system will continue to lure your engineers into making the same rational judgment calls that only in hindsight are obviously mistakes. In short, unless you change your system, you are doomed to repeat this same incident in the future. For a series-length exposé on the effect that a fearful and blaming post-mortem culture can have, watch Chernobyl.

A post-mortem is productive if you manage to identify a systemic reason for failure, and can adjust your system to reduce the likelihood of it happening again.

How to run a productive post-mortem meeting

Be prompt

As soon as an incident is over, people start forgetting what happened. Human brains tend to obscure the details of painful events (think of folks repeatedly getting tattoos or going through childbirth), and what this means for you, engineering manager, is that you need to post-mortem as soon as possible. Every day that goes by means you lose critical information, so aim to post-mortem within 48 hours if you can. That said, better late than never; even a fuzzy debrief is better than denial.

Keep to a small group

With incidents, the more people there are in the meeting, the harder it is to have a productive discussion. Resist the pressure to add in every stakeholder affected by the incident and eager to see corrective action taken. Ideally, your group should be under five people, and only those directly involved.

Make it safe

A feeling of true blamelessness is really important to make sure you learn what actually went wrong in the system. You don’t want people hiding mistakes and blaming each other, or excessively blaming and hectoring themselves. You need them to explain clearly why their actions made sense to them at the time, and for them to do this they must trust you. Every time I run an incident post-mortem, I like to state the goals and explain how human error is a starting point – for example, by saying something like, ‘You are all competent, rational people and I know you did your best, and that you did what made sense to you at the time. I’m interested in the conditions around the incident and how to change the conditions you were working in; I’m not interested in whose fault anything was’.

I also find that some humor goes a long way in these tense situations, as well as a genuine enthusiasm for finding, together, what went wrong in the system. When the leader relishes the hunt for the quirks and risks in the system, it helps team members embrace the hunt too and let go of some of the natural emotions of guilt and fear that accompanies any incident.

Start with facts

Start off the post-mortem with a factual statement of exactly what happened, as if it’s a printout from a video feed. There should be absolutely no emotion, judgment, or explanation attached. This is to state, as clearly as possible, what the problem and impact was. At the end of the meeting, when you collect action items, you can then see whether these actions would make that statement of fact less likely to occur. Here’s the exact factual statement of the most recent post-mortem I facilitated: On Friday 19th Feb at 10am GMT, we ran the rollout script for Shared Channels feature to 62,000 of our most active organizations. 8000 customers experienced issues with a channel connection. All customers who logged in over the weekend saw an in-app banner to all users, warning of recent issues that may require them to reconnect a channel. 

Then tell stories

Initially, I’d debrief incidents by stating what happened, and then go straight into what I called the ‘Five Whys’ approach. This involved asking ‘Why?’ for each layer of problem. For example, ‘Why did we roll this out on a Friday?’, or ‘Why did we roll this out to so many active users?’ Then you ask the next ‘Why’, based on the answer.

For example, if we realize ‘We rolled on on Friday because we were rushing,’ you might ask ‘Why were you rushing?’ If the answer was, ‘Because we misunderstood the PM’s request to roll out to 4% as “4% of all weekly actives”, and not “4% of new signups.”’ We might then ask, ‘Why did we refer to % targets and not absolute numbers?’ As you can already see, which ‘Whys’ you start with dramatically change the conversation (is the problem about deadline pressures? About rollout numbers? Something else?) As a result, I’ve since learned that going straight into asking ‘Why?’ isn’t the most helpful because it very quickly takes you down a narrow path to one of a variety of conclusions, ignoring the other avenues. In this case, all of those factors and more were relevant.

It’s a limitation of root cause analysis methods that for any incident, you can get down to the one, single ‘root cause’ that made everything go wrong – because this is often not the case. The ‘Five Whys’ method also tends to result in an oversimplified and linear cause-and-effect model, ignoring the reality of multiple overlapping factors that contribute to a failing system. You have complex humans navigating complex systems and a series of events is just as likely to be ‘the cause’ as one major event.

Instead of jumping down the ‘Why’ path in search of a potentially elusive root cause, have each person involved tell their story, in as much detail as they can. This takes time, but this is the important part. As the engineering manager, your job is to highlight and probe for any interesting bits of insight that explain why the system enticed this rational, well-meaning person into the decisions they made.

Human error is the starting point

Instead of blaming human error, a good post-mortem tells the ‘Second Story’. The First Story is the obvious one of human error: someone made a mistake. The Second Story is the story of systemic forces, the one that digs into why the person made a mistake in the first place.

First Story

Second Story 

Human error is the cause of failure.

Human error is the result of systemic vulnerabilities deeper inside the organization.

Defining what people should have done instead is the complete way to describe failure.

Saying what people should have done doesn’t explain why it made sense for them to do what they did.

Failures can be prevented if people are told to be more careful or punished for mistakes.

Only by constantly seeking out its vulnerabilities can organizations enhance safety.

From ‘Behind Human Error’.

Telling the Second Story is what allows air travel to be so safe, and what helps nuclear power plants to avoid melting down (see Nickolas Means’s excellent LeadDev talk on this exact situation). It turns out that when you blame pilots and nuclear power plant operators for ‘not following protocol’ – when you blame human error – you do not get more compliant humans that make fewer mistakes. You get airplanes crashing and nuclear reactors melting down. This dramatic field, known as Human Factors Engineering, seeks to always understand the systemic route of a problem, and uses human error as the starting point for an investigation, never the conclusion.

I like to ask, ‘Why did we, rational and well-intentioned humans, do X, when in hindsight, X was a mistake?’. In doing so, I’m baking in that human error isn’t the answer. Instead, we are looking for why the system made X seem like the right approach at the time. We’re not looking for solutions like, ‘Fire the person who did X’, or ‘Train people into never doing X’. We’re looking for solutions that change the system in which X looked like a great idea at the time.

Ask why, last

Once you have the stories and are using your human errors as the jumping-off point, you can then dig into why certain key decisions along the way seemed rational. Here, it’s critical that human error is your starting point. As the facilitator, you need to deeply believe that this behavior was the most rational thing possible under the circumstances and that you would have done the same in their shoes.

Look for an answer that points to the system, not to the humans. Then go on to ask your next ‘Why’. The next ‘Why’ can either be related (Why do we do this all the time?) or it could be relating to another part of the incident story. Remember that a single root cause often doesn’t exist, but you can and should learn why your system failed your people so that you can fix it.

Assign action items, write up, and share

Make sure that the post-mortem results in action items related to improving the system, and that these are assigned to individuals to be completed. Ideally, the actions are simple, iterative improvements, such as ‘add a checklist for launch’ or ‘add some automation to reduce an error-prone manual process’.

It’s important to then write up the post-mortem and share the findings and action items so that the organization more broadly can learn. Doing this well will also make it easier to keep group sizes small next time.

But what about accountability?

Inevitably, someone reading this or listening to you set up your blameless post-mortem will ask, ‘But what about accountability?’. If we are truly blameless, does that mean there is no such thing as a mistake, and that we are courting failure? To which the best response I’ve heard is that telling your story is the only accountability any well-intentioned person ever needs. And if you don’t think someone is well-intentioned (as in, you believe your incident was a purposeful act of sabotage) then a post-mortem won’t solve the problem.

On a more serious note, incidents are rarely related to job performance, and blameless post-mortems don’t mean that people shouldn’t get performance feedback on their work. These are two separate concepts. Feedback is important, and you should give it. Systems are fragile, and you should investigate their inevitable failures with curiosity, not blame, in mind.

As an engineering manager, you can mitigate worries about accountability by sharing that you will still give feedback, what you will give feedback on, and how you handle accountability to the job in general. If you try to use a post-mortem as a venue to give people feedback and ‘hold them accountable’ for their human errors, you are doing the software equivalent of the leadership at Chernobyl, Three Mile Island, and the airplanes that have fallen out of the sky. You’re assuming that punishment will deter mistakes. This doesn’t work – because most people are already trying hard to not make mistakes, and the instruction of ‘Be more careful’ doesn’t work in a complex system. Your systems will fail because that is what systems do, but if you punish those system failures, your people will hide them from you and they will begin to compound. You will end up with more failures, not fewer, by blaming human error. You will also end up with a climate of low psychological safety that causes deeper day-to-day problems than just your incident rate.

Conclusion

Incident post-mortems are a wonderful way to improve your systems. In fact, it’s only through well-handled failures that you will be able to build a resilient system and a resilient team. A culture of curiosity and blamelessness when things go wrong is what fosters the ‘generative’ culture of learning and improvement that characterizes high-performing engineering teams and the workplace satisfaction that similarly keeps people engaged. So, the next time something goes sideways, remember that human error is the starting point of your investigation. Use the steps in this article to change your system and, along the way, boost your culture of learning and improvement. After all, there is nothing as interesting as an unexpected systems failure.