Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

What is toil and why is it damaging your engineering org?

An introduction to toil and why it matters for SRE teams
May 16, 2022

You have 1 article left to read this month before you need to register a free LeadDev.com account.

The tech industry has always had localized expressions for work that was necessary but didn’t move the company forward.

‘Busy work.’ ‘Monkey work.’ ‘Muck work.’ ‘Chores.’ Now, thanks to the SRE movement, there is a word we can all use. That word is ‘toil.’

The concept of toil is a unifying force because it provides a way of identifying – and therefore containing – the work that takes up our time, blocks people from fulfilling their engineering potential, and doesn’t move the company forward.

Pager duty

Why does toil matter?

Not enough time and too much to do describes the default working conditions inside IT Operations. There’s an unlimited supply of planned and unplanned work – new things to roll out, incidents to respond to, support requests to answer, technical debt to pay down, and the list goes on.

With only so many hours in the day, how do we make sure what we’re working on actually makes a difference? How do we make sure our teams and our broader organizations are maximizing the kinds of work that add value, and finding ways to eliminate work that doesn’t?

To maximize both the value of your organization and the human potential of your colleagues, you need a framework to identify and contain the ‘wrong’ kind of work and maximize the ‘right’ kind of work. Understanding what toil is, and keeping the amount of toil contained, provides that framework. It benefits your company economically and improves the working lives of your fellow engineers. That’s a win-win situation.

Why are high levels of toil toxic?

Toil may seem innocuous in small amounts. Concern over individual incidents of toil is often dismissed with a response like ‘nothing wrong with a little busy work.’ However, when left unchecked, toil can quickly accumulate to levels that are toxic to both the individual and the organization.

For the individual, high levels of toil lead to:

  • Discontent and a lack of feeling of accomplishment
  • Burnout
  • More errors, leading to time-consuming rework to fix
  • No time to learn new skills
  • Career stagnation (hurt by a lack of opportunity to deliver value-adding projects)

For the organization, high levels of toil lead to:

  • Constant shortages of team capacity
  • Excessive operational support costs
  • Inability to make progress on strategic initiatives (the ‘everybody is busy, but nothing is getting done’ syndrome)
  • Inability to retain top talent (and acquire top talent once word gets out about how the organization functions)

One of the most dangerous aspects of toil is that it requires engineering work to eliminate it. Think about the last deluge of manual, repetitive tasks you experienced. Doing those tasks doesn’t prevent the next batch from appearing.

Reducing toil requires engineering time to either build supporting automation to automate away the need for manual intervention or enhance the system to alleviate the need for the intervention in the first place.

Engineering work needed to reduce toil will typically be a choice of creating external automation (i.e., scripts and automation tools outside of the service), creating internal automation (i.e., automation delivered as part of the service), or enhancing the service to not require maintenance intervention.

What should we be aiming for?

Working in an organization with a high ratio of engineering work to toil feels like everyone is swimming towards a goal. When there’s a low ratio of engineering work to toil, it feels like you’re treading water, at best, or sinking, at worst.

Instead of your people spending their time on non-value-adding toil, you want them to spend as much of their time as possible on value-adding engineering work.

A goal of ‘no toil’ sounds nice in theory. However, in reality, a ‘no toil’ goal isn’t attainable in an ongoing business. Technology organizations are always in flux, and new developments (expected or unexpected) will almost always cause toil. But just because a task is necessary to deliver value to a customer, doesn’t mean that it’s always value-adding work. For people who are familiar with Lean manufacturing principles, this is not dissimilar to Type 1 Muda (necessary, non-value-adding tasks).

Toil may be necessary at times, but it doesn’t add enduring value (i.e., a change in the perception of value by customers).

It comes from sources you already know about but just haven’t had the time or budget to automate (e.g., semi-manual deployments, schema updates/rollbacks, changing storage quotas, network changes, user adds, adding capacity, DNS changes, service failover, etc.). Toil also comes from any number of unforeseen conditions that can cause incidents requiring manual intervention (e.g., restarts, diagnostics, performance checks, changing config settings, etc.).

Although we can’t get rid of toil altogether, we should learn to be effective at reducing it and keep it at a manageable level.

Reflections

Ironically, toil eats up the time needed to do the engineering work that will prevent future toil. If you aren’t careful, the level of toil can increase to a point where your organization doesn’t have the capacity needed to stop it. If we use the technical debt metaphor, this would be ‘engineering bankruptcy.’

The SRE model of working – and all of the benefits that come with it – depends on teams having ample capacity for engineering work. This capacity requirement is why toil is such a central concept for SRE. If toil eats up the capacity to do engineering work, the SRE model doesn’t work. An SRE perpetually buried under toil isn’t an SRE, he is just a traditional long-suffering system administrator with a new title.

Pager duty