Site reliability engineers apply software techniques to operations to maximize uptime and avoid costly outages. But is this approach right for your organization?
Engineering leaders are often judged on the uptime of the systems and applications their teams build and maintain. It doesn’t matter how cool your feature set is: if your site or application goes down or the response time is unreasonably slow, the C-suite won’t be happy.
Traditionally, maintaining site uptime and performance was a job for the dedicated operations team, but recently those roles have been largely subsumed under the philosophical umbrella of DevOps.
Now, the modern discipline that keeps applications running and responsive is known as site reliability engineering (SRE), which applies the latest software development and automation techniques to the task of maintaining maximum uptime.
What is SRE?
SRE as a discipline emerged out of Google back in 2003. This shouldn’t come as a huge surprise: Google was one of the first companies for whom a rock-solid internet infrastructure was crucial for business, and the first to distribute that infrastructure over multiple data centers all over the world.
At its core, SRE is the practice of using software development tools and techniques to automate IT infrastructure tasks like application monitoring and system management. In the words of Google VP of Engineering, Ben Traynor, who coined the term over a decade ago, SRE is “what happens when you ask a software engineer to design an operations function”.
SRE teams interface much more closely with developers than traditional IT operations did. They will help plan the rollout of new features or applications to ensure that site reliability isn’t sacrificed during the process.
The abbreviation SRE can refer to site reliability as a discipline, or to an individual site reliability engineer tasked with putting it into practice. Like traditional IT ops staff, the SREs on your team will sometimes have to be on call so they can swing into action if a data center goes down or an application goes haywire. But when done right, SREs create policies and automation tools that will keep your system afloat without anyone having to get up in the middle of the night.
Google’s SRE team remains at the forefront of the discipline today – it employs thousands of people and keeps the internet’s most important website humming. Google’s team has also produced some great in-depth documentation on SRE best practices and strategies.
Key SRE principles: How does it work?
What does SRE look like in practice? To understand how you’ll implement SRE in your own organization, let’s go over some of the key principles to understand how they work.
- Eliminate toil. In the SRE world, toil is any repetitive work that operations staff has to perform manually to keep systems running optimally, like running scripts or poking through logs. The specific kind of work we’re talking about here is tactical and reactive to problems – when you’re done with it, you’re back where you started before the problems arose and haven’t necessarily shifted things in a way that creates enduring value. Anyone with an ops background knows that such toil cannot be completely eliminated, but SRE shops aim to have engineers spend no more than half their time on it. To that end, implementing SRE practices requires you to automate many of these tasks, using infrastructure-as-code techniques and tools like Ansible to alleviate some of this toil. Only through extensive automation does site maintenance become scalable and sustainable.
- Monitor everything. Your automated tools need data, and to do that you need monitoring tools throughout your infrastructure. In a world where infrastructure can be distributed across on-premise data centers and multiple public and private clouds, this can be easier said than done. You’ll need to consciously plan for observability as you build out your systems.
- Establish service-level objectives. In order to ensure that your infrastructure is reliable, you need to establish what reliable means to your organization. Part of your job as an engineering leader is to build a realistic consensus. Find out what all stakeholders need or expect and then determine what your technical team is capable of: what levels of uptime and application responsiveness do your customers and business users need, and what can your technical team deliver?
- Quantify and embrace risk. By agreeing on required levels of uptime, you’ve also defined acceptable amounts of downtime. This is an important aspect of the SRE philosophy as well: you can use that acceptable downtime to build an error budget that will allow you to roll out new features and other updates that might have temporary negative effects on system performance. In effect, your error budget is the breathing room your whole team will need to experiment and push the limits of your systems. (Google has an in-depth guide to formulating an error budget with more details.)
The ability of SRE teams to accomplish all these goals flows from that first point: eliminating toil. As an engineering leader, your job is to give your team space to do so.
Bootstrapping an SRE team can mean more work and resources to start, as your team will need to both deal with maintaining site uptime manually and building the automated tools that will take on much of that labor going forward. But once you reach a place where your SREs are only spending half their time putting out fires, they can spend the rest on more strategic tasks, like researching ways to improve reliability further and coordinating with developers on application and feature rollouts.
SRE vs DevOps
As should hopefully be clear by now, SRE and DevOps don’t exist in opposition to one another. In fact, while SRE’s origins actually predate DevOps, the two dovetail nicely, to the extent that SRE can be considered an important component of best DevOps practice.
- By establishing service-level objectives and error budgets, SRE teams ease tensions between developers – who want to continuously roll new updates into production – and operations – who worry that such updates will disrupt currently smoothly running infrastructure.
- By applying software development techniques to operations – especially infrastructure-as-code, in which your infrastructure is defined by configuration files that are saved and worked on in code repositories like Git – SRE teams can integrate operations into the CI/CD pipelines used by developers.
Keep in mind that SRE isn’t the only player on the ops side of DevOps. Platform engineering represents a separate support team who designs and maintains the toolchains developers will use to write code and build applications. SREs, by contrast, help developers get that code running in production and make sure it stays within agreed parameters of reliability.
Advantages of SRE
Hopefully, by this point it should be clear how leading the charge to implement SRE can help your organization:
- You create a focus on customer expectations and user happiness. DevOps aims to improve the speed and quality of software engineering, and those are obviously good things. But SRE, in particular, is focused on keeping users happy by striving to understand what their uptime and performance needs are and then aligning your operations activities and development schedule to meet those needs. This is an important process for engineering leaders to spearhead; the truth is that your users may not fully understand what they need themselves, so quantifying those needs is of great value all around.
- You improve metrics and incident reporting. You can’t fix what you can’t measure. Automated tools need concrete metrics that they can understand and respond to. As an engineering leader, you may want to use an SRE push as an opportunity to improve your infrastructure’s observability.
- You empower team members to think strategically. If all your ops team does all day (and when they are on call) is fight to keep your site up and running, you’re going to suffer from rapid turnover due to burnout and disillusionment. By reducing toil and establishing that 50% of your SRE team’s time should be spent on strategic initiatives, you give your team members room to learn and grow, and feel like they’re moving forwards.
Downsides of SRE
It’s hard to argue in theory against any of the benefits SRE could bring to your organization. However, creating an SRE team and establishing the elaborate structure of automation required to underpin their work involves a significant investment of resources.
While some organizations have no choice but to put in that sort of investment, most should not delude themselves that they need to operate on the scale of a FAANG company and require the same level of site responsiveness as Google. It’s fully possible that your current operations practices support your actual needs. Part of your job as an engineering leader is to make that call honestly, rather than pursuing an SRE project because it’s a cool industry trend.
If you do decide to pursue SRE at your organization, one of the difficulties you’ll encounter is finding the right people to hire. A good site reliability engineer needs an in-demand combination of skills.
If you’re hiring for an SRE team – or looking to transform your current ops team into a true SRE powerhouse – here’s what you need to look for:
- Both dev and ops skills. An SRE needs to both understand infrastructure like an expert sysadmin, but also be able to write and understand code like a developer. Some SREs may start their career as programmers and others as sysadmins, but they should be comfortable with the other side and be willing to learn quickly. If you’re hiring internally, you may want to sniff out those team members who have a “jack of all trades” reputation among their peers. When it comes to development, you’ll particularly need the ability to work with and develop system automation and observability tools, but SREs also should be able to understand your product codebase and talk to developers about how best to deploy it.
- Communication skills. SREs will work with developers, platform engineers, and others to ensure smooth operations. They need to be able to empathize with other teams and synthesize competing demands in order to come up with error budgets that can accommodate everyone’s priorities. Yes, they sometimes need to be woken up in the middle of the night when something goes wrong, and to deliver bad news if that wrong can’t be righted quickly. All this requires the ability to communicate deftly and understand where people are coming from.
- A strategic vision. Remember, the goal for an SRE is that they should be spending half their time working on strategic projects to advance reliability goals. They should have the passion and foresight to see beyond your current architectural setup and envision ways it can be improved or transformed.
As an engineering leader, your job is to find the people who fit this bill – or who have the potential to grow into the role. Once your team is in place, you need to fight to get them the resources they need and help them coordinate with other teams to spread best practices. Implementing SRE can be a big investment and a major philosophical shift, but in the long run, it can pay big dividends in reliability and overall user happiness.