Observability goes beyond logs, metrics, and traces, and it's up to you as a manager of engineers to set out the vision and policies to make that possible.
Observability is a term that has leapt from the obscurity of engineering textbooks and into common usage across the technology industry. At its core, observability is an ideal state, where the inner workings of software systems can be monitored and maintained. This goal has become increasingly important today, as organizations struggled to understand the inner workings of more and more complex, distributed applications.
While certain tools can help developers and site reliability engineers (SREs) strive towards this goal, it's up to you as an engineering leader to lay down the policies and a vision to make observability possible.
What is observability?
Observability as a software principle grew out of control theory, a mathematical discipline developed to explain mechanical engineering processes. In control theory, the formal definition of observability is the ability to infer the internal states of a system based on its external outputs.
It’s difficult to pin down exactly when this concept made the leap into the software domain, but a 2013 blog post by the Twitter engineering team titled observability at Twitter may be a good place to start. In the decade since, observability has become a topic of keen interest for engineering leaders aiming to build more resilient systems, as well as a useful term for monitoring and logging software vendors to use to market their products.
Monitoring vs observability
To fully understand observability, let’s compare it to simple monitoring. On the surface, they seem like they might be the same thing – surely monitoring a system is the same thing as observing it?
While monitoring is an important part of observability, achieving full observability goes beyond just throwing a bunch of network packet sniffers and log analyzers at your problem. While these tools can help you with the known unknowns – that is, they can monitor the data that you point them at, like central processing unit (CPU) uptime, or network throughput – a truly observable system is architected to surface the unknown unknowns, or problems you weren’t even aware of yet.
The three pillars of observability
Observability can be said to rest on a foundation of three important system outputs, sometimes called pillars:
- Metrics offer a top-line view into system performance. They’re good for telling you what’s happening to your system at a moment in time, such as if there is an overwhelming amount of network traffic, or is latency acceptably low? Is your disk space filling up? Depending on your application or business, some of these metrics will be key performance indicators (KPIs) that tell you about the immediate health of your system and its ability to respond to user needs.
- Logs provide a historical record of those metrics over time. They can help give you context for current problems so you can start diagnosing them. Say there was a mysterious spike in CPU usage this afternoon. Has anything like that happened before? Does it occur on a regular schedule, is it always at the same time as another seemingly unrelated event?
- Traces record the path a user or system request takes across multiple components within a system. These are an increasingly important source of information in distributed cloud-based architectures, where requests might be routed across different virtual machine (VM) instances, containers, and components in ways that aren’t immediately obvious. It can tell you why users are seeing poor performance or where bottlenecks are degrading application speed.
Together, these three data sources can drive you toward true observability. There is also a growing set of both proprietary and open-source platforms available to help you derive useful patterns from that information.
As an engineering leader, you are responsible for choosing the tools that are the best fit for your organization’s needs. You’ll also need to figure out the best tools to instrument your systems with, as each comes at a cost.
For example, if your SREs use a dashboard with an overwhelming number of metrics on it, it’s easy to get lost in the details, so you need to know which are most important. Logs are a rich mine of data, but they can also quickly fill up disk space, so you have to decide how long to keep them and how detailed they can be. And tracing tools can slow application and network performance, so it’s important to understand to what extent that will happen and how sparingly they should be used.
Why is observability important?
The key benefit to building your systems with observability in mind is the ability to diagnose underlying problems more effectively and understand the real-time fluctuations that can affect the performance of your digital products – bringing obvious advantages for uptime and customer satisfaction.
Observability isn’t just important to SREs or engineers responsible for the uptime of an application. Developers can use a better understanding of the underlying platform to build more efficient and high-quality software at scale.
This philosophy also helps break down walls between different IT silos, which is of particular interest to the many organizations implementing DevOps, as it can help developers and operations specialists better understand the internal state of an application, and how their code affects the underlying infrastructure after deployment. That all hopefully adds up to faster troubleshooting and more efficient code development.
Observability can lead to less time spent in meetings as well. An observable platform comes closer to the dream of being “self-documenting”, which means that developers and operations staff can understand how things work at a glance, rather than needing to consult with the people who built it.
Observability challenges
Developing a truly observable platform isn't necessarily an easy task however. A recent survey from LogDNA and the Harris Poll found that 74% of respondents are struggling in their observability quest. Some of the challenges those companies cited include:
- Finding tools that can support multiple use cases and allow different teams to collaborate.
- Ingesting the wide variety of data produced by different tools and managing them in standardized formats.
- Controlling the costs associated with data storage and management, as well as the various tools and platforms involved.
These are all challenges engineering leaders will have to overcome in their observability journey.
How can you make a system observable?
Implementing observability in practice requires two big steps. First, you need to instrument your application or infrastructure. This means putting software and network tools in place that measure all the things you need to know about your platform – collecting the metrics, logs, and traces we discussed above.
The next step is to find a platform that allows you to manage all that raw data. This can include alert management systems that let you know when metrics are out of safe territory. Dashboards will also help to coordinate data from disparate sources, allowing developers and SREs to visualize patterns of performance and get into those unknown unknowns. There is also a big opportunity in this space for machine learning to be applied to analyze all this data and provide you with insights and automated remediation actions.
You have many options when it comes to observability tools, including commercial products and open-source offerings, including the fast-emerging OpenTelemetry project. You can also piece together individual observability tools, or go for an integrated platform. You might want to start with Gartner's ratings of top APM and observability tools to assess the vendor landscape.
What comes next?
If your organization is building an application or platform from scratch, you need to ensure that observability is part of the conversation as you plan.
If you need to make your existing infrastructure more observable, start thinking about how you can instrument your current code, what skills your team needs, and what the performance tradeoffs might be.
When you're ready to learn more, check out these articles from LeadDev: