Observability isn't only helpful when dealing with outages.
Engineering teams are often led to believe that observability is something to concern yourself with only when dealing with service outages. But while it is indeed crucial to be able to ask questions of your systems during an incident, and rapidly get answers, this is not the only way to reap the benefits of observability. Personally, I have found immense value from leveraging observability in many other scenarios. Here I’m going to outline a few of them, and share some tips for expanding your use cases, so that your team can start making the most of observability.
When to use observability tools
Understanding what’s normal in your systems
Rather than only digging through observability tools when something is wrong, I have found it really valuable to pull up tools like jaeger or zipkin to understand how our systems actually function. I call this exploring, as opposed to debugging. I do this because if we don’t know what normal looks like, then it’s much tougher to spot subtle problems. Understanding what’s normal also helps us to set appropriate thresholds in our systems, beyond which we need human intervention.
Routinely exploring our systems to understand what normal looks like is also useful in more modest scenarios, such as knowing what interdependencies exist between parts of a distributed system. I find that observability tools typically give you the most accurate version of a service topology. So rather than digging through documentation that might be outdated, it may be better to head straight for your observability tooling. Although, as with anything that involves instrumentation, the caveat is that the information will only be as accurate as your systems are instrumented.
Understanding the impact of changes during development
Observability tooling can be really helpful during the development stage to understand the impact of changes being made to systems. For example, by including metadata about deployments and service versions amongst the telemetry collected by your tooling, it is often possible to interrogate changes in the behavior of your system based on a particular deployment event, service version, commit hash, or tag. Subsequently, instead of simply running specific tests in dedicated environments (which rarely simulate production accurately), you can use your observability tools to give you a more accurate picture of how your systems are actually morphing in response to delivered code changes.
This really helped my team when we were attempting to improve the response time of an API that was dependent on a somewhat popular ERP system. When we made our planned changes (switching out the type of client that was calling the ERP system) and ran performance tests in isolation, both response times and service throughput appeared to improve dramatically. However, when we deployed the changes, there was only a minimal improvement. It was only later when we examined traces in jaeger and metrics in prometheus that we realized we had not accounted for a peculiar behavior; the ERP system introduced a performance penalty as high as 5000 milliseconds for each first unique request within what appeared to be a ten-minute time frame, after which each subsequent similar request was returned within sub-second intervals. This would go on to be repeated after a short period of inactivity.
This behavior would have been impossible to catch with our performance tests as they were. But using our observability tooling, it was immediately clear what was happening. When utilized in this way, observability ultimately becomes a more accurate source of feedback to cross-check expectations from development as well as to complement testing efforts.
Getting insights into the developer experience
When I was working with a previous client, there was a huge push to understand how teams were tracking towards the four key metrics (deployment frequency, lead time for changes, change failure rate, and time to restore). While we were unable to exhaustively agree on ways to effectively measure all the above, we found that there was one related question which almost always came up, and that this question could be accurately answered with our observability tooling. That is, how long were the builds routinely taking, and what, if any, were the bottlenecks?
At the time, our build tool didn’t provide the insights to help us answer that question. To solve this, we started building out some custom tooling around our CI/CD pipeline. However, we quickly pivoted to simply adding instrumentation to the pipeline so that we could use the same observability tooling that we were already familiar with to get the answers we sought. This immediately offered tonnes of insight, and allowed us to better understand our build process and intentionally plan for optimizations and improvements.
Interestingly, since then I have seen a few more features released from other providers of observability tools that seek to provide more insight into pipeline durations and bottlenecks in the CI/CD process. I imagine there is only going to be more interest in using observability tools to better understand – and improve – portions of the developer experience.
Tips for expanding your usage of observability in your team
You might be thinking that these scenarios sound overly specific, and that it’s not clear what you can do differently to get more out of your own observability process. How can you introduce the concept of everyday observability usage to your team – and make sure that it’s still relevant beyond service outages? Here are a few approaches that worked for me.
Encourage your teams to explore the systems they are building in the absence of service outages. Ask the team to practice looking for interesting patterns, or to even simply share information they learned about how their service interacts with services built by other teams.
On many teams I’ve been on, we have often had a person assigned to incident duty who among other things might oversee incoming support requests and/or be a first responder to outages. While there are no outages, this same person could be encouraged to explore the services the team is building and report back any interesting findings.
Similarly, consider onboarding all your team members (not just SREs) to your observability tooling. Encourage everyone on the team to use the available tools to explore how their systems are changing as they introduce new modifications. And encourage them to share these stories.
Make observability part of your definition of done
When writing up new user stories, make observability part of your definition of done. There is often no justification to wait to add it as an afterthought. For example, any new instrumentation that needs to be added for a new feature or service should be included as acceptance criteria in the associated stories. This ensures that your team is thinking intentionally about how to make your system observable, and that right from the start, they are able to bake in any required changes. It also ensures that your system remains observable as it continues to grow and change.
I have been on teams in the past where I have encouraged everyone to use the observability tooling to showcase the impact of the changes they made as part of their demos or desk checks. Not only does this ensure that teams are gaining familiarity with the tooling, but it also helps everyone become comfortable with talking about the impact of their changes in a way that is backed by actual data.
Instrument beyond the traditional
Yes, you can use observability to understand your running applications, but have you thought about applying that same instrumentation to your infrastructure? How about your CI/CD tooling? Or some other component of your system such as a particular control plane? The core tenets of observability can be extended to many atypical situations if we just resolve to expose as many signals and as much data as we reasonably can, and then think about how we can query that data to better understand what’s going on.
Even when instrumenting typical applications, consider adding domain-specific metadata beyond traditional attributes such as service name, endpoint, status code, cluster, or compute region. For example, if you’re instrumenting a service that is part of an order fulfillment flow, are you able to include data about the type of order, the ID of the fulfillment center, and perhaps even the order amount? This is just an example, but getting specific with your instrumentation allows you to not only interrogate your systems about outages but also explore behavior and patterns that are specific to your domain.
While observability is often immensely valuable when debugging and attempting to recover from service outages, there are plenty of other ways that we can use it, not only to improve the systems we’re building but also to improve the experience of our teams. When we think of observability in this way, it’s clear that it shouldn’t just be the concern of our SREs, but should be a priority for anyone in the team with a vested interest in continuous learning and improvement, as it relates to the systems they are building.