You have 1 article left to read this month before you need to register a free LeadDev.com account.
Estimated reading time: 9 minutes
Key takeaways:
- Your tolerance for operational debt could be someone else’s burnout. What feels like background noise to you is a wall of red to a newer engineer.
- Stop being the hero and start fixing the system.
- If your systems only work because a few people know where the landmines are, that’s not resilience, it’s a liability.
During your last on-call shift, an alert fired and you barely glanced at it before going back to more important issues. Why? Because you know that particular alert is noisy. If it fires a second or third time, then you will act.
Whether consciously or not, over time, you have built a tolerance for operational friction. Your familiarity with the systems has led you to tune out the noise and see clear signals. It has given you a “spidey-sense” for which recurring issue is at play. It may have even caused you to normalize an overly manual process because it has become muscle memory. Eventually, that friction stops feeling like a problem. It is just how things are.
Now switch places with your team’s newest member. What is background noise to you is a wall of red to them. They attempt to triage all of the signals coming in and fail to keep up, often missing critical signals. That manual process takes them a much longer time to piece together, leaving higher priorities unattended to. They are overwhelmed and losing confidence.
This is what it looks like to be functioning with operational debt. The gap between what you tolerate and what others can safely handle has gotten too wide. You have accumulated enough friction that normal work depends on a handful of operational heroes who know how to dance around the landmines. If you are the staff engineer, there is a good chance that hero is you.
Your inbox, upgraded.
Receive weekly engineering insights to level up your leadership approach.
Drawing a line between on-call and support
One of the most common forms of operational debt is a lack of boundaries between on-call and support.
When the same person is handling production alerts and support requests, everything starts to blur. It all comes through the same channels, to the same people, at the same priority.
It becomes very hard to distinguish what truly cannot wait from what would be fine as a ticket to handle within 24 to 48 hours. As we are wired to empathize with whoever is in front of us, we end up reacting to whoever shouts the loudest.
You fix their problem in the moment, but usually nothing changes to prevent it from happening again.
A simple, high-leverage move is to create a clear boundary. Even better if you can divide this into two separate rotations. On-call handles time-sensitive production risk. Support handles everything else on a defined service level. You can still be responsive without pretending everything is a priority one (P1), and you give the team a much clearer mental model for what truly counts as urgent.
The fear of declaring incidents
Another place operational debt hides is in the way a team treats incidents. You have probably seen some version of this. An engineer pushes a bad change. A customer-facing issue pops up. Instead of opening an incident, people quietly scramble in Slack. Someone reverts, someone patches data, a few direct messages go out to calm things down. Everything is back to normal in an hour.
Staff engineers in particular are good at “just fixing it.” You know which buttons to push and which people to ping. You can paper over the underlying problems with expertise and hustle. The system never has to get better, because you always save it in time.
No incident. No record. No real follow-up.
The reasons people find to justify this are myriad. “Incidents should only be declared if we need to involve another team.” “We fixed it quickly so declaring an incident is just overhead.” “It was just a bad code change, there are no lessons to be learned.” “It makes us look bad.”
All of these reactions are understandable, but they’re also wrong. If an issue affects customers, interrupts on-call, or requires a coordinated response, it is already an incident. Whether you declare it or not only changes whether you will learn from it.
Declaring incidents isn’t just about learning from what happened. It is also about learning how to run incidents and handle operational issues. Breeding that familiarity in lower-pressure scenarios pays off big time in high-stakes outages.
More like this
Signal-to-noise ratio
Alert fatigue is another area where domain expertise can mask real problems. If you have been around the system long enough, you already know which alerts to ignore. You skim the subject line, see the familiar pattern, and go back to what you were doing.
Newer engineers do not have that filter. They see a screen full of red and a notification stream that never stops. They do not know which alerts are safe to ignore, and they do not want to be the person who missed the one that mattered.
You have alerts that fire in a Slack channel. You have alerts in that same Slack channel that tag the on-call. You have alerts that can come in through email. You have alerts that go to a more prominent channel. Does the level of risk for an alert match the level of attention you’ve attributed to it in the system? Has anyone checked recently? Did you ignore that on-call ping last time it tagged you for that alert? If you did, it’s probably no longer accurate.
If you are not careful, you end up in a situation where the visibility of an issue has nothing to do with the risk and everything to do with history and momentum. The level of interrupt no longer matches the level of risk and you are sending the rest of your team on wild goose chases.
As a staff engineer, you’re well-positioned to do an alert review. Reduce as much noise as possible. Calibrate the signal to the level of risk it poses. The goal is for alerts to be meaningful enough that people take them seriously without burning out.
Pay attention to specific kinds of pain
Not all operational problems are equal in how much they hurt people. Urgency matters. Autonomy matters just as much. High urgency with high autonomy is stressful, but tolerable. The site is in trouble, but the person on-call can take clear action. They can reboot a service, trigger a failover, or run a script. They might be tired, but they are not helpless.
High urgency with low autonomy is where real damage happens. In those situations, the person on-call is responsible for handling an urgent issue but cannot actually do much about it. Any scenario where the responsibility and visibility of the issue lies with your team, but the ability to remediate requires another team, is one of these situations.
A downstream service is down but there is no fallback nor built in resilience to the failure. A key service is out of capacity but you’re reliant on another team to provision more. Anything that requires permissions you do not have in order to provide remediation.
As a tech lead or staff engineer, these are the types of operational friction you want to address first: high urgency but low autonomy. Find places where you’re blocked in taking action by another team or even a long-running manual process. Start brainstorming investments to improve the outcomes.
Addressing these mismatches has an outsized impact on how people feel while carrying the pager.

New York • September 15-16, 2026
Speakers Camille Fournier, Gergely Orosz and Will Larson confirmed 🙌
The correlation between operations and burnout
When you talk about immature operations, people usually think about outages and risk. Do we go down? How often? How bad is it when we do? Those things matter, but they are not the full story.
Operational immaturity also shows up in how it feels to work on the team day to day. Some signals include:
- Teammates fear their on-call rotations or actively complain about them.
- The same one or two engineers are pulled into every incident or issue because things can’t be resolved without them.
- Newer engineers are on rotation for six months to a year and still are only able to handle a subset of operational issues.
- Repeat issues continuously occur, even with known improvement opportunities, because there isn’t time being spent to remediate the root cause.
Over time, that environment wears people down. Tenured engineers burn out from being constantly on the hook. Newer engineers burn out from feeling constantly out of their depth. Your engineers use up all of their energy being on-call and you lose out on their contributions everywhere else.
Getting out of reactive mode
It’s important to recognize if you’re constantly operating in reactive mode. If people are constantly juggling alerts, interruptions, and urgent escalations, there is no time left to improve the root cause issues.
The only way out of operational debt is to prioritize it. That means making cuts elsewhere, if only temporarily. Let’s say the ideal operating mode is 60% features and 40% everything else. A team with operational debt needs to turn that on its head. If you want to be particularly aggressive, you can do a full quarter or six months focused exclusively on buying down the debt.
Your job as a staff engineer is to have the awareness to make that call and bring evidence to leadership to make the case. If you can’t make a large shift, then start with the smallest incremental improvements you can make. Build in additional operational resiliency into estimates for new features. Raise the bar for anything you’re adding to your system; whether it’s more automation, runbooks, well-tuned alerts, etc. Those are now part of the Minimum Viable Product (MVP).
The key is to stop treating operational chaos as background noise and start treating it as a first class citizen.
The role of the staff engineer
As a staff engineer or tech lead, you sit in a unique position in all of this. You are close enough to the systems to see where the friction is. You are senior enough that people will follow your lead in how they treat incidents, alerts, and on-call. You also have the most practice living with the pain, which makes it easier for you to ignore.
Do not be proud of your tolerance. If your internal standard is “I can live with this,” that is the wrong bar. Ask instead whether your team is handling this environment. What about if you were to get a new team member? If you’re not standing over their shoulder, how comfortable do you feel?
Design for average, not for heroes. Systems and processes should be safe and workable for most competent engineers, not just the people who know every corner case by heart.
Make the hidden cost visible. Track interruptions, incidents, support loads, and pages, and bring that data into planning and prioritization.
Most of all, model the behavior you want to see. Declare incidents when they meet the bar. Differentiate on-call from support. Review and tune alerts. Advocate for ‘buying down’ the worst operational pain, even when it competes with feature work.
Your job is not to be the person who can withstand the most chaos. Your job is to help build a system where chaos is not a requirement for doing the work.