It’s time to get observability on your leadership’s radar.
In the rest of this series, you have heard about the technical benefits of observability. But technical merits are not the only consideration for spending the scarce time of our engineering teams and our finite budgets. How can you make the business case for adding observability practices on your teams and into your software? How do you choose from the many available options for improving observability? And how can you measure the impact once you've invested in observability?
Identifying a gap in observability
Often, organizations make knee-jerk changes once there's a smoking gun in the form of a major outage, and a singular root cause to that outage. It's significantly easier to build the business case for a new toolset or method when there's a ‘if only we'd had backups’ line in an often-blameful postmortem. Executives read the postmortem, often sack the employee who deleted the important file, engage consultants to introduce a backup strategy, then breathe a sigh of relief thinking that the gap is now closed. This short-sighted approach requires both a catastrophic failure and a simple fix.
A lack of observability is much more pernicious. It silently gums up all aspects of your software engineering and operations team. Repeated outages impair the ability of employees to do their job, and degrade customer trust in your business. There is no singular root cause to point to, only a slew of unexplained system behavior and near-misses. Alert fatigue leads to burnout and engineers churning along with their valuable expertise. And a lack of understanding of complex software systems leads to system fragility and slows the delivery of features that the business demands. Thus, the signal you should be looking for on your teams is excessive fatigue and rework.
Are you always on the back foot because customers discover critical bugs in production before you do? Do you have incidents and bugs piling up faster than they can be retrospected and follow-up items triaged? Are your features delayed by weeks or months because engineers can't figure out how their microservices are entangled with the monolith and the rest of the microservices?
Although there may be contributing factors for each individual area and tooling that specifically addresses it, it is important to also address a systemic lack of observability. Teams operating in the dark without observability will witness occasional minor issues exacerbating into repeated catastrophes. Once your teams can observe the inner workings of your systems in production, you as a leader will be able to see further down the road. With increased predictability, you will have fewer unpleasant surprises for your executive leadership.
Creating an observability practice
Observability's goal is to give engineering teams the visibility into their production systems required to develop, operate, and report on the system. Similar to security and testability, observability isn’t just a checkbox requiring one-time effort. Instead, it is an attribute of the sociotechnical system comprising your teams and software. A sophisticated analytics platform is useless if the team using it feels overwhelmed by the interface or is discouraged from querying due to fears of running up a large bill.
A well-functioning observability practice should empower engineers to ask questions, allow them to detect and resolve issues in production, and begin to allow the answering of business intelligence questions in real-time. After all, if nobody is using the new feature that the engineering team has built, or if one customer is at risk of churning because they are persistently experiencing issues, is the business truly healthy?
As the DevOps movement gains mainstream traction, forward-thinking engineering leadership teams remove barriers between engineering and operations teams. With the practice of software ownership, each development team builds and operates its own services. Without observability, software engineers lacking on-call experience tend to struggle with understanding where failures are or how to mitigate them. Observability gives software engineers the appropriate tools to debug their systems instead of needing to rely upon excessive manual work, playbooks, or guesswork.
When introducing an observability practice, engineering leaders should first ensure that they are fostering a supportive, blameless culture that rewards curiosity and collaboration. They should ensure that there is a clear scope of the work to introduce observability, such as on one team, or in one line of business. And they should identify what infrastructure and platform work is required to support this effort. Only then should they begin the technical work of instrumentation and analysis.
Acquiring the appropriate tools
Although observability is primarily a cultural practice, it does require engineering teams to possess the technical capability to instrument their code, to store the telemetry data that is emitted, and to analyze that data in response to their questions. Thus, your initial technical effort will require the set-up of tooling and instrumentation. While there is the temptation to roll one's own observability solutions, building an observability platform from scratch that actually supports the capabilities you need is prohibitively difficult and expensive. Instead, there is a wide range of solutions, whether commercial, open source, or hosted open source, to consider.
For instrumentation of both frameworks and application code, OpenTelemetry is the emerging standard. It supports every open source metric and trace analytics platform, and it’s supported by almost every vendor in the space. There is no reason to roll one's own instrumentation framework or lock into a vendor's instrumentation. Thanks to OpenTelemetry's pluggable exporters, you can configure your instrumentation to send your data to multiple analytics tools. Resist the urge to think of needing all three of ‘metrics, logging, and tracing’ as observability; instead think about what data type or types are best suited to your use case, and which can be generated on-demand from the others.
Data storage and analytics often come as a bundle, depending upon whether you use open source or proprietary solutions. Vendors of proprietary all-in-one solutions include Honeycomb, Lightstep, New Relic, Splunk, Datadog, and more. Open source frontends include Grafana, Prometheus, and Jaeger, but they all require a separate datastore to scale. Popular open source data storage layers include Cassandra, Elastic, M3, or InfluxDB. It’s great to have so many options, but be wary of the operational load of running your own data storage cluster. Unfortunately, end-users have found that their ELK cluster gobbles systems engineering time and grows quickly in cost. Therefore, there's a competitive market for managed open source telemetry data storage. I would also caution against buying a separate product for each ‘pillar’, or attempting to bolt on observability to an existing monitoring system. Since observability arises from your engineers interacting with your data, it is better to have one solution that works seamlessly than to maintain three or four disjointed, poorly usable systems.
As always, ensure that you are investing your engineers’ time on differentiators for your core business. Observability isn't about empire-building and creating larger teams, it's about saving businesses time and money. That isn't to say ‘don’t create an observability team’ – a good observability team will focus on helping product teams achieve platform or partner integration rather than trying to reinvent the wheel with a custom backend. Do an evaluation of which platform best fits the needs of your pilot teams, then make it accessible to your engineering teams as a whole.
Measuring virtuous cycles
If the symptom of teams flying blind without observability is excessive rework, then teams with sufficient observability have predictable delivery and sufficient reliability. What does this look like, both in terms of practices as well as key results?
Once observability practices have taken root in a team, little outside intervention will be required to maintain excellent observability. Just as a team wouldn't check in code without tests, checking that there is instrumentation during code review becomes second nature. Instead of merging code and shutting their laptops at the end of the day, teams observe the behavior of their code as it reaches each stage of deployment. ‘Somebody else's problem’ becomes excitement over seeing real users benefitting from the features they are delivering. Observability isn't just for engineers; when you empower product managers and customer success representatives to answer their own questions about production, it results in fewer one-off requests for data, and less product management guesswork.
As teams reap the benefits of observability, they will feel more comfortable understanding and operating in production. The proportion of unresolved ‘mystery’ incidents should decrease, and time to detect and resolve incidents will decrease across the organization. However, do not over-index on shallow metrics such as raw number of incidents. It is a good thing for teams to feel more comfortable reporting incidents and dig into near-misses as they gain increased visibility into production. You’ll know you’ve reached an equilibrium when your engineers live in their modern observability tooling and no longer feel that they're wasting time chasing dead-ends in disjointed legacy tooling.
Whenever your teams come across new questions they can’t answer, they will find it easier to take the time to fill in those gaps rather than guessing. Is there a mystery span that is taking too long? They will add subspans for smaller units of work within it, or add attributes to understand what is triggering the slow behavior. While observability always requires some care and feeding as integrations change or new code surfaces are added, a solid choice of observability platform can minimize operational burden and minimize the total cost of ownership.
Your teams may even wind up creating Service Level Objectives (SLOs) to track their progress towards reliability, or turn off potential-cause-based alerts that are duplicative of alerts based on symptoms of user pain. SLOs allow executives and engineering teams to communicate about reliability as a product requirement, and can give direct value to executives by helping them understand service reliability at a glance.
Now you’ve learned the business case for observability, you can give your software engineering and operations team superpowers to ship quickly and with confidence. You’ve learned the risks of delaying: a lack of observability will hobble the growth of your business, risk user trust with outages, and burn out your engineering team. So what’s the first step? It’s time to get observability on your leadership’s radar. Schedule time to make the case to your executives – even informally – by the end of the quarter. Start investing in observability now; by this time next year, you’ll wonder how you ever lived without it.