In every engineering role I’ve had, from a startup with a team of five engineers to Netflix, I’ve seen the panicked dash of an engineer to their desk, breathless with their phone clutched in their hand, having just been paged for a production outage.
To say it feels 'hurried' is an understatement. Whatever type of business you’re in, the knowledge that users are impacted, and that the business could be losing money and trust, emphasizes the ticking of the clock as you try to track down and resolve your issue.
This is a scene set for observability tools. Without metrics, traces, and logs, you’re grasping at straws, digging through recent commits, or checking for property changes. With metrics, traces, and logs, you can start to piece together the real story of what happened. Observability into a system is a powerful multiplier for engineers. It can help bridge the gap between new engineers and seasoned engineers, smooth the communication of bugs between teams, and generally hasten resolution. It’s powerful stuff.
What is observability?
Observability is the practice of instrumenting a system to generate data as it operates to provide insight into what it’s doing. A system is often made up of multiple services, and with observability, you can get insight into the individual services as well as how they fit together into a system. The three pillars of observability are metrics, traces, and logs.
- Metrics indicate how the system or subsets of the system – like services – are performing at a macro scale.
- Traces follow an individual request through the system, illustrating the holistic ecosystem. Traces can provide timing details (how much time did a request spend in a given service?), a call-graph depicting a request’s path, and indicate associations through tags (build numbers, asgs, etc).
- Logs give a rich, detailed view into an individual service. Logs give the service a chance to speak their piece, and document what went wrong as the service tried to execute a given task.
How to get the most out of observability tooling
As with all things, observability has a cost. Implementation and storage are two that come to my mind immediately, and one of the first questions I’d have is, ‘How can I get the best return on my investment?’ In my experience building observability tools at Netflix, I’ve found a few things that can make a huge difference in how usable and leverageable your observability data is.
One relatively simple thing you can do to maximize your ROI is to tie your traces and logs together by writing the trace ID into your log. Exactly how you do that depends on your implementation details (here’s a handy guide for Spring Boot), but the benefit is well worth the effort. By tying together your logs and traces, you combine the ecosystem overview of a trace with the inner service details of a log. For a failing request, the combination of traces and logs can point you to exactly what went wrong and where. A trace says you hit services X, Y, and Z. A log from Z for a particular trace ID can tell you exactly what went wrong; for example, that it failed to fetch information for user ID 123 because the caller had inadequate permissions. This is a powerful combination.
Combining traces and logs can offer much more detailed insight into requests, but the traces and logs have to exist, which can get tricky. Many tracing implementations are focused on sampling a small fixed percentage of traffic. A fixed sampling rate of say, 5%, makes it much less likely that you’ll find data on a particular failing request. Accordingly, this makes it less likely for developers to rely on distributed tracing to help solve their issues. If developers can’t rely on it, they won’t use it, and if they don’t use it, why spend money or time on it at all? One place to start is to capture 100% of a reasonable subset of business-critical traffic. Perhaps that’s all calls to your '/checkout' endpoint. Over time, a more sophisticated sampling approach could help you ensure you record useful data that is representative of your dataset, or help you identify anomalous traces. But if you can start somewhere small with high impact data, that’s a great way to make trace data an essential tool for the users who troubleshoot those services.
The users who troubleshoot services are often developers, but not always. At many organizations, customer service operations find themselves in the role of troubleshooting user issues, which can directly tie back to a trace. At Netflix, we’ve built a tracing UI that correlates traces with their relevant logs, and packages it together into a tool that customer support can access. Even without our custom UI, this same technique can help customer support answer questions and escalate issues appropriately. With the traces and detailed logs, customer support can see where an issue occurred, and get insight into what happened.
Getting it right
When you use observability to get a better understanding of what’s going on in your systems, you can shorten the time to a resolution during outages, but that’s far from the end of the benefits. At Netflix, we’ve seen distributed tracing become a key tool throughout the software development lifecycle. During testing, traces and logs can give you insight into the latency of a new endpoint, or help you identify miscommunication between services. Distributed tracing eases the burden of communicating issues to partner teams. Sending someone a link with a trace that illustrates an issue is an easy way to clearly show what happened, with minimal hypothesizing, and without a cascade of supplemental information. If you have a log of the payload passed between services, it can make reproducing the issue extremely straightforward. If you’re able to make your distributed tracing tool available to high tiers of customer service operations, that can take even more weight off your engineering teams. Distributed tracing can even help new engineers onboard more efficiently by visualizing your call graph and the flow of a request between services.
There’s a laundry list of benefits, but the key to making observability tooling a success is to make it valuable and make it easy. Strive to make your trace data as accessible as possible. You can do this by returning trace IDs in your API payloads by default, or by tagging your traces with information about the caller or calling application to enable users to search for traces with those tags. Expand your observability tools to development or testing environments. Like any tool, the more often it’s used, the more comfortable users are with it. Commit to a tracing client that works with the types of services you have, and build abstractions that minimize work on developers. We’ve found particularly strong value in the abstractions we’ve built around automating log correlation by trace ID. It’s not one-size-fits-all, but maintaining a focus on developer experience is essential to ensuring the tools you invest in actually get used.
Telling better stories
When you tie together metrics, traces, and logs, you tell a detailed story about what is happening in your services. The metrics set the scene: is a particular service experiencing a high error rate? The trace gives you the timeline and all the participating services, as well as some clues through tags (like status, build number, region, etc.) that can tell you more detail about each participating service. Having this broader context allows you to narrow in on the logs, which give you the gritty details. The combination of all three gives engineers the ability to see into a complex system, and understand its inner workings.
The initial draw of observability is often understanding your systems when it’s urgent and important, like when you’re running back to your desk after being paged for a production outage. The beauty of observability tools is that there is a ton of value to be had outside the resolution of a production issue. The stories you tell with observability give your engineers a speed boost in productivity, from initial testing through to operating a service.