Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

How Netflix, Teachers Pay Teachers, Honeycomb, and more used observability in 2021

Engineering leaders reflect on their observability wins over the past year
December 22, 2021

You have 1 article left to read this month before you need to register a free LeadDev.com account.

We asked four engineering leaders how they’ve been using observability over the past year and what they’re looking forward to in 2022.

Has observability changed the way your engineering teams work?

Lesley Cordero, Software Engineer, Teachers Pay Teachers: Teams tend to think of observability’s importance mostly in the context of emergency situations, like outages or incidents, which is a very reactionary state. And while that is incredibly important, it is just as important to use observability in a proactive fashion. For the teams I’ve been on, getting to this place of being able to proactively identify issues before you’re in a high-stress situation has been one of the biggest wins of being observability-driven: we’re able to use this tooling, knowledge, and culture in a way that sets us up for long-term success.

Parveen Khan, Senior QA Consultant, ThoughtWorks: The team now thinks of adding instrumentation as part of the feature development, rather than as an afterthought or nice-to-have. It’s not just about adding some random logging to tick the box; it’s more of a Dev and QA (tester/Quality Analyst) pairing effort, working together to make sure the data is clear enough to understand, that it creates enough visibility, and that it gives enough information to understand what’s happening with each request.

Additionally, when debugging any issues, if the team notices there’s a lack of visibility and cannot find the root cause which could have easily been found with some more specific instrumentation, they take that as feedback for improvement rather than playing the blame game. This improvement could be anything related to process or instrumentation.

Kristie Howard, Site Reliability Engineering Manager, Netflix: Fundamentally, yes! Good observability practices allow your team to infer a system’s internal state from the data it outputs, which decreases the time it takes to triage issues. Great observability practices provide context that can lead to an enriched level of actionability, meaning you are spending more time fixing ‘the thing’ than searching for ‘the thing.’ This creates the conditions to shift from being a very reactive team to one that is more proactive and spends less time firefighting. When this happens, the team can then focus more on innovation and activities that add value to the business; oftentimes it also increases job satisfaction for the engineers.

Liz Fong-Jones, Principal Developer Advocate, Honeycomb: Practicing observability-driven development and continuous delivery has enabled us to iterate faster and get faster feedback on our product. We can practice production ownership without it being painful for the teams on call. And engineers feel happier because they can get immediate validation that their features are working!

To achieve this, we’ve added observability to our CI/CD pipelines to speed test times, modified our GitHub pull request templates to include ’how will you observe this in production? link a graph.’, and established Service Level Objectives (SLOs) for all services.

Adopting these practices isn’t just good for morale; it enables us to ship a dozen times per day, including on Fridays, and to be nimble in competing with other companies.

Honeycomb graphic

How do you get buy-in for observability from other teams?

LC: Having buy-in across an engineering organization is especially important in a microservices architecture because gaps or inconsistencies can easily emerge and make it incredibly painful to observe systems fully and successfully. One of the easiest ways I have found to get this type of buy-in is to make building observable systems easy. Microservices architectures often require that building services is easy and has a consistent definition of production readiness. Observability should have that same standard. Whenever possible, I try to make observability as out-of-the-box as possible.

PK: Getting buy-in, in my opinion, depends on various different factors, and context is a major part of it (context could be the size of the team or teams, what tools already exist, how easy or difficult it is to get approval for trying a tool, and so on). I find that the following things can help:

  • Sharing and showing the value of observability, why it is crucial, and how it can help the team
  • Building trust and understanding the pain points of the team 
  • Making it a collaborative exercise and showing the team they are contributing and adding value
  • Making the team feel empowered to contribute

With my team, I tried to find someone who was interested in trying new things and solving problems. Having conversations and sharing my thoughts on observability was key. I delivered an internal talk to introduce and spread the word of observability which worked really well as it created interest in people to find out more about it. I also gave a demo of what I tried, with a trial version of tools that were available in the market. You could call it a mini proof of concept to show some parts of what observability looks like.

KH: Culture is a key aspect to gaining buy-in for observability, especially from non-DevOps or SRE teams. To achieve this cultural shift in any organization, observability must be well articulated in strategy documents and roadmaps, and correlated to increased ROI. I would urge leaders to try to avoid centralizing the responsibility of observability with one team or part of your ecosystem. Developers, Operations, Product, and SRE teams alike should all be focused on improving observability as a North Star goal. Most importantly, educate your teams on observability and draw clear distinctions from terms such as monitoring, telemetry, or visibility. While such terms all work in concert to enable observability, they all mean different things.

LF-J: While we don’t need buy-in for observability from our engineers who have chosen to work at Honeycomb (an observability vendor), we often need to explain to potential customers how observability differs from tools they’re used to and what they will gain from adopting observability.

The eye-opening moment for many teams is seeing data flowing in for the first time from automatic instrumentation in their systems and being able to analyze and debug in real-time along any dimension. And once teams see the connection between attributes added in their code and trace spans in their real systems, they understand how continuously adding instrumentation benefits them, just as tests and comments do.

The Accelerate State of DevOps Report 2021 shows that elite performers are 4.1 times more likely to use observability tooling to make production decisions and improve system health. Observability isn’t just a fancy new way of saying ‘monitoring’ or ‘debugging.’ What sets observability apart is the ability to slice and dice your data, any way you see fit, to unlock new understandings about how your applications truly behave when experienced by end-users.

How do you interpret the data from observable systems?

LC: Our interpretation of observability data revolves around interpreting events in the form of spans & traces, logs, and metrics. Being able to move between these different representations of an event is an important part of my team’s strategy because the entire context is necessary for asking questions until a root cause is identified.

PK: Having all the data centralized in one place makes it easier to query and interpret it to find and understand the information against the requests. It also helps in creating some dashboards to keep an eye on if there are any spikes. Having data handy was really powerful for my team; depending on what questions we wanted to answer, we could answer them by looking into logs, traces, metrics, or dashboards that were derived from the same data. Our teams used dashboards to look for trends around certain failures for our services, which gave us a view of trends around those failures. We also used dashboards for API response times and shared these trends each month with business stakeholders to show that the response times fell within SLA.

KH: Interpreting data from observable systems requires rich cross-functional collaboration between developers, data, infrastructure, and operations engineers alike.  Not only do they need to share their data, they need to layer on the context that only they would have about the metrics, logs, and traces that are being analyzed to inform decisions. They are all subject matter experts in their own right with different perspectives based on their roles. Bring them all together to create balance in your observability strategy and you can achieve great things for the business.

LF-J: Interpretation of observability data fundamentally involves an OODA loop: observe, orient, decide, act. The faster a team can understand the context of the system, what the anomaly is, and dive into an anomaly, the faster it can then ask another question or resolve an issue. That’s why we use SLOs extensively for orienting teams to the business context and BubbleUp to help our own teams understand what’s different about the failing requests. Once a team has formulated a hypothesis, our query builder allows them to test that hypothesis and the granularity to examine individual failing requests, as a trace waterfall illuminates the details of even the darkest corners of our systems.

Rather than needing to learn a completely bespoke query language, engineers only need to click and drag to highlight an anomaly they’d like to investigate or write a small snippet of SQL to specify the dimensions they’d like to hone in on. And the query history functionality allows our engineers to start with jumping-off points from previous incidents or investigations.

What are your observability goals for 2022 and beyond?

LC: Teachers Pay Teachers’ general strategy for observability revolves around further building a culture of observability in our engineering organization. To do this, some specific goals include:

  • Continue our journey in making observable services easier to build. For us, this means expanding our engineering platform to provide tooling and integrations out of the box.
  • Strengthen the knowledge bases of our engineers through a more thorough onboarding process and documentation of what our standards are.

PK: I would like to explore and learn more about building constructive goals in achieving a path to observability, and understand what metrics would be useful to measure those goals. I would like to learn more about what it takes to make a system observable and construct some steps out of the learnings to share across different teams and the wider community. I would also like to advocate for how QA’s can add value while building observable systems.

KH: My observability goals don’t change much from year to year, as they are always centered on improvement. What does change, however, is where we look to make those improvements and the tools or tactics that are employed to measure and achieve success. For example, in 2022 I am keen on partnering with fellow engineering leaders to ensure that observability is integrated into each stage of the software development lifecycle. This may sound like a table stakes requirement, however, it is paramount that observability be a constant consideration for organizations with ever-scaling infrastructure and growing complexity. If you think you have ‘arrived’ in terms of availability, shift your focus and fast forward six months and see what happens.

LF-J: Observability is for everyone, not just tech-forward startups or unicorns, but also tiny startups, large established enterprises, and everyone in between. It’s for everyone because doing things the old way (traditional APM, monitoring, and practices) means doing it the hard way. Doing things the new way, with observability, is genuinely easier – it takes fewer engineers less time to understand and thoroughly debug issues, no matter how obscenely complex their systems.

So the overarching business priority is to make observability accessible to everyone, through design and features that make it easier to get started and share learnings across team members, through building our community of observability advocates, and through investing in our engineering team and practices as we ‘dogfood’ everything we offer to our customers.

Honeycomb graphic