Monitoring debt leads to alert fatigue and increased operational risk, is it time you performed an alert audit and got your monitoring in order?
It will likely be a scenario familiar to any engineer who has been asked to take a pager and be on-call: the ratio of signal to noise feels completely out of sync, leading to a frustratingly large number of alerts to deal with.
This is what we refer to as “monitoring debt”. It’s the hole you dig by not changing how you monitor while you change your technical system. The longer this drift occurs, the less relevant the alert thresholds can become. The engineers on-call become numb to responding to false positives and can start to treat pages like the boy who cried wolf.
Unfortunately, the solution is not as simple as telling your engineers to just “set better alerts” either.
“Coming up with threshold values is not a trivial task,” Slawek Ligus wrote in Effective Monitoring and Alerting. “The process is often counterintuitive, and it’s simply not feasible to carry out an in-depth analysis for a threshold calculation on every monitored time series."
This issue is not unique to distributed software systems either. In fact, the bulk of research being done around this topic is in the medical field. Alarm fatigue was cited as a major patient safety concern in a 2019 paper noting health care staff must contend with an average of 700 physiologic monitors per patient each day in an environment where 80-99% of alerts are false or nuisance alarms.
Ineffective pages are an operational risk and the deeper in monitoring debt you get, the more likely your on-call engineers are to experience alert fatigue.
By conducting an alert audit, followed by continuously tuning alerts as a part of the software development workflow, you can start to set your organization on the path away from alert fatigue, and even to alert joy.
The telltale signs of monitoring debt
There are a few common symptoms of monitoring debt:
- There are one or two people on your team that refine alerts, or it’s exclusively the job of site reliability engineers (SREs).
- Updating or checking monitor accuracy is not a part of your deployment process.
- Alert configuration is not cleanly ordered or hierarchically namespaced.
Breaking down alert fatigue
Just how risky is it to “set and forget” your monitoring policies?
A high number of false alerts over time trains workers to assume most alerts will be false. Even worse there is a compounding effect each time the same false alert is triggered. Data for medical clinicians shows the likelihood of acknowledging an alert dropped 30% for each reminder!
This is alert fatigue in action – when a high number of alerts numbs responders and leads to missed or ignored alerts, or delayed responses. The more an engineer is exposed to false alerts, the more they will tolerate, normalize, and eventually ignore them.
The effects of alert fatigue don’t disappear after an on-call shift ends either. Analysts at IDC found that 62% of IT professionals attribute alert fatigue to high turnover rates. Put simply, by not tackling alert fatigue, organizations risk losing tenured engineers to burn out.
The signs to watch out for regarding alert fatigue are:
- High number of alerts.
- High number of false alerts.
- High number of un-actionable alerts.
- Ever present stress about “if/what important signals am I missing?”
- Muting alerts as a first line of defense.
- Spending too much time investigating false alarms or unactionable alarms.
Alert fatigue is serious business, affecting engineers on and off the pager. Now let’s turn to mitigating the harmful effects, starting with an alert audit.
Running an effective alert audit
Start the conversation
Trying to audit your organization’s entire set of alerts is a daunting task. One approach to scoping the area of concern is to start with a particular engineering team or on-call rotation. After the inaugural audit you can focus on developing a repeatable template for other teams to follow.
A key step in building trust with operators is listening to their on-call experience and letting them know that actions will be taken to make things better. Don’t promise the world, but do let folks know that the end goal is to maintain sustainable on-call operations and that this is a shared responsibility with leadership.
If you have a healthy team dynamic, these conversations can be done synchronously over video or audio calls. Otherwise, consider asynchronous options like a survey or asking in a 1:1 before going forward with a group discussion.
Set the baseline
The baseline is a measure of the status quo. At first, the data can seem overwhelming and impossible to manage. Push through those feelings and be honest with your team and wider organization about your findings – the good, the bad, and the ugly.
Gather the facts
Look back across a defined time, be it the last week, month, quarter or whatever makes sense for your system and pull the data on pages, warnings, tickets generated, and any other basic signals from your on-call rotations. Now ask:
- How often was each team member on-call?
- How many pages per shift?
- How many warnings per shift?
- The ratio of alerts from pre-production and production environments.
- How many out-of-business hours interruptions occurred?
Roll this up in a pivot table and share widely with your team. This is what will frame the feelings you gather next.
Gather the feelings
Now you should survey your on-call engineers. Specifically check in with engineers earlier in their careers, since they won’t have normalized the noise yet, and longer tenured engineers who know the hot spots but have become jaded. Walk through the baseline data with them and listen to how they interpret it and if they agree with your findings.
Ask probing questions
- What percentage of time are engineers working on the sprint while on-call?
- Are alerts named confusingly?
- How does the team manage their monitoring configurations today?
- Is there a team or vendor that is providing “out of the box” alerts?
- How many alerts were received as a result of doing planned work?
A simple framework for pulling this together is:
Feeling: "None of the last five pages I got were actionable."
Fact: “The primary rotation paged five times over the last week.”
Finding: “Team X is getting paged frequently for non-actionable reasons.”
Now you have to evaluate your alerts. All of them. Seriously.
- Take one alert and look at its history: have the times it has fired been actionable for responders or not?
- Walk through investigating the last time it fired.
- Is the warning threshold reasonable?
- Is the alert threshold reasonable?
- Is there a runbook or are links to monitoring and docs easily accessible?
- Decide what action to take:
- Do nothing
- The tune threshold.
- Demote to a warning.
- Demote to a daytime ticket.
Anything that can page a human is fair game. Keep in mind that the goal is to help the human operators at the end of the day. The art of monitoring and alerting isn’t widely taught, a quick way to bring everyone up to speed is to pair on evaluating the first alert as a group together.
Ask the team: “What business impact should rise to the level of alerting a human?”
Once every team member has a good sense of the process and expectations, there are several options for completing the evaluations. Depending on the amount of engineers in the on-call rotation or the sheer number of alerts the remaining evaluations can be divided up among each engineer, folded into the primary on-call engineer’s weekly tasks or continue to be an all-team paired activity.
Wait until a full rotation has passed through your team before looking back and holding an on-call retrospective. Review the actions taken to tune alerts and discuss how the experience has or has not changed the feeling of holding the pager.
If progress hasn’t been substantial or lopsided, figure out where to shift investments and continue iterating!
Celebrate, share, repeat
After the first team has figured out a workflow for auditing alerts, take a beat to recognize the improvements and investments you’ve made! Operation work tends to be unglamorous, but the benefits of tackling alert fatigue are certainly cause for celebration.
Remember to share your findings widely and openly with other teams both informally and formally.
Repeat across teams and fold alert tuning into the software development lifecycle to maintain your results.
Why focus on alerts?
There is a deluge of data that can be alerted on and a culture of “set it and forget it” monitoring which often buries operators in low-quality signals. The 2022 Cloud Native Complexity Report from Chronosphere found 59% of respondents reported that half of their incident alerts are not actually helpful or usable. That needs to change.
If you find yourself agreeing with that statistic it might be time to tackle your monitoring debt with an alert audit. Let us know how it goes!