“Trial by fire” is destroying your incident response

The worst time to learn about observability is during an outage.

By Kristen Foster-Marks

November 18, 2025

You have 1 article left to read this month before you need to register a free LeadDev.com account.

Estimated reading time: 5 minutes

Anxiety is the enemy of learning. What does this mean for how engineers learn observability fundamentals?

For the first several years of my software career, I was painfully inefficient at troubleshooting production issues. Despite having access to software monitoring and observability platforms, I only used them when something was broken in production.

The effect was that all of my learning around observability concepts happened while I was actively troubleshooting production issues, which typically happens under the dual duress of ticking clocks and repeated update requests. Sometimes, this happened live in a war room, in front of colleagues, exposing my lack of knowledge for all to see.

If you’re anything like me, this situation is extremely anxiety-provoking. I often didn’t know where to look, what data and signals to track, or even what questions to ask. The dashboards and data felt overwhelming, and I didn’t feel safe admitting knowledge gaps. So, incident after incident, I kept muddling through.

This was nearly five years ago, when a lot of these monitoring and observability platforms offered a fraction of the products, dashboards, and views they do today.

If they were intimidating to navigate then, they’re even more so now.

Your inbox, upgraded.

Receive weekly engineering insights to level up your leadership approach.

Learning on fire is the worst way to learn

The problem with my approach wasn’t the platforms themselves; it was that I was trying to learn them at the worst possible time.

A belief I’ve encountered often in the software world is that developers learn best by “getting their hands dirty.”

While that might be the case, science also tells us something else: anxiety has the potential to impede learning.

This applies practically to trying to learn a platform during an incident. It’s like trying to learn how to read the map after you’re already lost. You need those skills before the resultant anxiety impairs your judgment and ability to orient yourself.

Create calm, safe, communal spaces for skill building

I experienced a big “aha” moment around incident response anxiety when I became a technical upskilling specialist at Pluralsight. In that role, I managed skill development for our internal engineering teams. One of my initial tasks was to run a needs analysis across our two engineering organizations to determine their top skill development priorities.

The number one knowledge and skill gap that surfaced for both orgs?

Software monitoring and observability.

I ended up building two separate learning experiences that took into account the specific needs of each organization, including the different observability platforms each was using:

A two-day observability workshop for the organization using Datadog.
A four-week observability academy for the organization using New Relic.

While I was in the weeds designing the curriculum for these programs, I kept thinking: “If only I’d known these concepts and tools back when I was still in the code. I could have been so much more effective in troubleshooting. And so much less anxious!”

The learning environments I created for my learners were calm. It was communal. It was low-stakes, and it was exploratory. This was designed in deliberate contrast to the urgency of live troubleshooting in production systems. This difference in learning conditions had a significant impact on software engineers. People asked questions without fearing judgment. They explored metrics and dashboards they’d never visited before. They discovered ways to monitor and observe software systems that they didn’t know were possible, let alone made available to them in their organization’s observability platform of choice.

The best part, though? They vulnerably shared stories about past troubleshooting mishaps and mistakes, and the less experienced got to see that not even the folks with decades of software experience knew everything there was to know about observing software systems.

No, you can’t prepare for everything. But that’s not the point.

Creating calm and communal spaces for training is so important, but production issues are rarely that simple. Systems are highly complex. Users notoriously find novel ways to break things. No amount of training will prepare you for the fact that surprises are simply the norm when you’re building complex software.

But, with the proper pre-crisis training, you can get yourself 80% of the way to observability (and observability-platform) proficiency. With focused learning time, you can explore the myriad varieties of metrics, signals, and data that observability tools surface. You can develop an intuition for which of these to examine in certain situations. And perhaps most importantly, you can figure out where to find these signals so that you’re not adding this to your cognitive load while under the stress and duress of a production issue.

In short, you won’t be learning the conceptual domain and the platform simultaneously during a stressful outage.

This proactive approach to building resilience isn’t unprecedented in software. Some teams practice chaos engineering, which entails deliberately injecting failures into a system to test how it behaves under stress. It’s a powerful way to surface hidden weaknesses before they cause an outage.

I think focused learning around observability concepts and tooling is the human equivalent. When engineers proactively explore their observability tools, identify knowledge gaps, and build skills in a focused way, they’re more prepared for “game day.” They’re more resilient – and less panicked – when the real thing happens.

“Shouldn’t developers just learn this on their own?”

After the two-day observability workshop I mentioned, someone left anonymous feedback that stuck with me:

“Software developers should know how to direct their own learning. We shouldn’t need to be told when and what to learn.”

I understand the sentiment, but the reality is more complex. In learning science, that ability is called self-directed learning ability (SDLA), and it encompasses the following sub-skills:

Identifying one’s own learning needs
Setting appropriate, achievable learning goals
Finding relevant, quality learning resources
Applying effective learning strategies
Evaluating progress toward learning goals

In a perfect world, every developer – every person – would have a strong SDLA.

But in the real world, there are countless barriers – some individual, some societal, some cultural, some institutional – to developing the skills comprising SDLA. And ironically, in my experience, those most in need of structured support are often the least likely to self-advocate for it. This is where guided training is useful. And with everyone on board, it becomes all the easier to identify and close skill and knowledge gaps so we can work together to address them.

Deadline: January 4, 2026

Call for Proposals for London 2026 is open!

Submit your talk idea

Invest in confidence before the crisis

While the systems themselves are bound to improve with skill building, this isn’t just about increasing uptime or improving mean time to recovery. It’s about people and their professional development. It’s about building confidence and self-efficacy. It’s about bolstering resilience in both systems and humans.

Let’s stop pretending that the best time to learn about monitoring and observability is when something’s on fire. Learning these concepts and tools is what we should be doing before the alarms go off.

About the author

Kristen Foster-Marks

Kristen is a Technical Curriculum Developer at Datadog

Newsletters

Panel discussions

Videos

Reports

For you

New York

Berlin

London

Meetups

“Trial by fire” is destroying your incident response

By Kristen Foster-Marks

Your inbox, upgraded.

Learning on fire is the worst way to learn

Create calm, safe, communal spaces for skill building

More like this

No, you can’t prepare for everything. But that’s not the point.

“Shouldn’t developers just learn this on their own?”

Invest in confidence before the crisis

About the author

Kristen Foster-Marks

New York

Berlin

London

Meetups

“Trial by fire” is destroying your incident response

By Kristen Foster-Marks

Your inbox, upgraded.

Learning on fire is the worst way to learn

Create calm, safe, communal spaces for skill building

More like this

No, you can’t prepare for everything. But that’s not the point.

“Shouldn’t developers just learn this on their own?”

Invest in confidence before the crisis

Share:

About the author

Share:

More like this