Incidents and postmortems can be stressful for everyone involved, but there is a way to resolve issues with empathy for everyone involved.
Developing software can sometimes feel like rowing a boat through a swirling maelstrom, trying to stay afloat through a constantly-changing set of circumstances.
Issues are inevitable in this line of work, what matters is how we handle them. Great DevOps teams know the answer lies in studying how engineers respond to incidents. The lessons learned there are a lens into the wider organization, and provide a path to building more stable systems and less stressed teams.
Viewing incidents as investments
Let’s look at an incident I recently helped investigate. A key piece of our architecture is our overnight extract, transform and load (ETL) data system. The details of the incident are fairly simple: none of the ETL jobs ran overnight because someone had turned it off. To resolve things, we needed to turn it back on.
The issue resolved, we could have gone back to our regular day-to-day work. But that would mean missing out on a valuable learning opportunity. Instead, we should view incidents as investments, not catastrophes.
And if that investment has already happened, how can we aim to get the most value out of that work? An incident means that our mental model of our system missed something. What can we learn from the incident? What are the best ways to fuel that learning?
Most companies already include investigation as part of their incident process. It’s common practice to have a postmortem meeting, where incidents are discussed. It’s rarer to have a dedicated investigation phase, where a site reliability engineer (SRE) digs through logs, reads chat messages, and interviews participants to learn as much as possible about the incident.
Not every incident needs a deep investigation and review though, and like any other aspect of engineering, it’s up to the team to determine where to invest their effort. Let’s start with postmortems.
A brief history of postmortem strategies
A long time ago (in SRE years) a blameful postmortem would have pinpointed the engineer responsible for turning off the ETL system and concluded that they aren’t to be trusted with on/off switches. This focus on human error not only misses the deeper questions, but also discourages future incident response, as teams are more focused on avoiding blame than solving problems.
Thankfully, we’ve largely moved past that point as an industry. Most companies have embraced Blameless Postmortems, where the focus is on what happened and not who did what. A blameless postmortem lets us analyze and learn from the technical side of our systems. In the case of the missing ETL jobs, we could learn why the jobs were turned off in the first place and how that mechanism works.
However, such an investigation often misses the deeper insights. Humans are constantly meddling with the system, deploying changes, flipping feature flags, or even changing the base infrastructure. How do we learn from that?
The cutting edge of incident investigation has now shifted towards Blame-Aware Postmortems, where we acknowledge that these discussions can tend toward blame and actively try to counter that instinct. The keystone of blame-aware thinking is to assume that everyone involved in the incident did the best they could with the information available to them at the time.
Putting Things Into Action
And you may ask yourself: “How did I get here?” – Talking Heads, Once In A Lifetime
In practical terms, running a human-focused postmortem starts well before the actual meeting. By reading through the chat logs from the incident, it is possible to identify some key themes, where people got stuck, and identify any potentially sensitive issues ahead of the meeting.
Now you can start the meeting by outlining the timeline of the incident and some of the themes you found. This helps postmortem participants build empathy with those at the sharp end of the incident.
With the stage set, use blame-aware thinking to prompt discussion. Open-ended questions help discussion flow and give an opportunity for different voices to be heard. One trick is to avoid zooming in on any particular individual and instead focus on the system itself.
A good starter question might be: “Why does our system rely on one individual to remember to do the right thing?” Let the discussion flow and don’t go too deep into any particular topic or solution. Make sure to thank people for contributing. Speaking up in these meetings can be hard.
One question I make sure to ask in every postmortem is, “what surprised you?” This is a great way to identify the knowledge gaps and assumptions among the team, as well as giving the floor for people to talk about what they’ve learned.
It’s also important to remember that your goal with this meeting is learning, and we can’t learn from hypotheticals. Don’t focus too much on action items, as these often end up being overfitted alerts or tickets addressing edge cases that fall into the backlog never to be touched again.
All of this is a lot of work, but it’s worth it! Running human-focused postmortems centered around learning helps discover the edges of our systems as we scale. It’s also a helpful onboarding tool for employees to quickly pick up practical learnings about how our systems operate and makes them less afraid to speak up when things go haywire.
Save yourself from the tedious postmortems of old, where you read a timeline and create the same old tickets. Now, by asking questions, you might just find some surprising answers.