Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

Three steps for managing toil as you scale

Reducing engineering toil as your company grows
April 27, 2022

You have 1 article left to read this month before you need to register a free LeadDev.com account.

Managing scale at a fast-growing startup is challenging enough, but managing the inevitable toil that arises as you scale adds to that challenge.

Toil is the operational burden of engineering. It’s the work that needs to get done to keep things running, but doesn’t make any new impact. As companies scale, toil can creep up not only in infrastructure and architecture, but also in processes. And as new pressure points are discovered, things can begin to break down.

Over the last few years working at Upside, I’ve had a front-row seat to how our company and our individual teams have scaled. And at every level of the business, the same pattern has played out in how we approach things: first we scale by adding people, then we standardize with process, then we automate with technology.

Pager duty

People, process, technology

Our three-step approach to scaling (‘people, process, technology’) has parallels with Simon Wardley’s ‘Pioneer, Settler, Town Planner‘ model. A pioneer is exploring new worlds, blazing new trails, and occasionally failing. This mirrors early-stage growth via people and hiring. There’s not a lot of formal process or infrastructure, but there are people working hard and toiling away to accomplish their objectives.

As settlers move in, they turn half-baked trails into roads by which they can transport supplies into a village and start to build more permanent fixtures, map the territory, and plant fields. They start to standardize what the pioneers have done, building some common processes. At a startup, they might be defining routine ways of supporting the operations and toil of the business.

And finally, town planners fully automate what the settlers have done. Large-scale design and infrastructure comes into play to support a booming population, and automation sets in with roads and stop lights. In a startup, tooling is built, automation and platform-ization begins, and then the cycle starts anew.

Here I’m going to walk you through our experience of each of these steps and share how we reduced toil along the way.

Stage 1: Scaling by adding people

Culture eats strategy for breakfast

…or so says Peter Drucker. Hiring is an important piece of the culture puzzle, and every company wants to do it well. But that’s merely the tip of the iceberg. Culture isn’t just how you hire, but how you live and breathe your company’s values day in and day out.

As we were building out our transaction processing team, we were inheriting a codebase from another team, while hitting an inflection point and needing to support 10x growth. After hiring well, we also had to unleash the growing team on the problem and encourage relentless curiosity and creativity. We dedicated time to dig in to understand the system we inherited and went spelunking in not only our code, but other systems, diagrams, and designs so we could be confident in our implementations.

We also had to get creative; once we understood the system, we could modify and extend it to meet new scale and operational challenges. One of our key proposals was to create a dedicated ‘clean room’ pipeline, so we could reprocess historical data without affecting core system SLAs. Perhaps obvious in hindsight, but it wouldn’t have happened without buy-in from the team and product managers to reduce operational toil.

Shared foundations pay off in the long run

It’s easy to spin up new servers in the cloud, but not so easy to spin up new engineers. Hiring and onboarding are notoriously time-consuming but absolutely critical to the success of the team.

Each new hire is an opportunity to improve the process, and often teams will leave it to new team members to update documentation as one of their onboarding tasks. One of the key things we did early on in our data team was adopt the ‘handbook first’ approach, inspired by GitLab. Our onboarding for new hires, system documentation, how-tos, and FAQs are all in a central location, accessible to everyone. And curating that handbook is not just a new hire task, but the responsibility of everyone on the team.

Questions are answered with a link to the documentation, or occasionally with an update to the documentation followed by a link. This shared foundation and mental model is an expensive investment upfront, but pays off over time, with more self-service, faster onboarding, and less time reconstructing information.

If you want to go fast, go alone. If you want to go far, go together.

Engineering doesn’t exist in a vacuum. Product teams are often the closest partner team for engineering, and yet the relationship can sometimes be strained. Product may accept scope increases and shortcuts to optimize the feature set or time to market in order to deliver business value and hit goals or commitments to customers. Engineering may optimize for maintainability, elegant design, or performance, even at the expense of immediate customer or business value.

When not handled well, the interplay of these tradeoffs can lead to increased toil, maintenance, bugs, and even outages. These are somewhat exaggerated descriptions, but they illustrate the tension between scale, speed, and (reduced) toil: you can only have two.

Bringing Product into the discussion early and building that empathy for the shared goals of the team is critical. Ask product managers to participate in the operations of the team so they understand the impacts of their proposals, and have engineers listen in to stakeholder calls so they can hear the needs of the customer firsthand.

Engineering teams should also build relationships with HR and recruiting teams. These folks are vital to help you scale, both through hiring and supporting individual growth. Growing the team will enable you to then grow your process and technology to help fight back against increasing complexity and toil.

Stage 2: Standardizing the process

You bail water, I’ll row the boat

As an individual contributor, focus time is a precious commodity, and as a leader, protecting that for the team is a high priority. On one team, we packed all of our meetings in on Tuesdays and the rest of the week was almost entirely meeting-less. On another team, we found our balance in ‘Focus Fridays,’ where other than an optional ‘coffee chat’ video conference, there were no other obligations.

Another tactic for achieving focus time for the team, especially when toil threatens to overrun a roadmap or sprint velocity, is to institute a ‘shield rotation.’ We designate one person per week or sprint who will take point on all inbound requests, whether for an incident, bug fixing, or just consultation. The person doesn’t need to solve everything, but has to at least triage and play ‘air traffic control.’ This is an expensive proposition, but every team I’ve been on with operational toil has appreciated the focus time it buys them when not on duty, and the added perspective it brings to the rest of their work when they are.

Identify and protect the right metrics

At the risk of falling victim to Goodhart’s Law (‘when a measure becomes a target, it ceases to be a good measure’), identifying and protecting the right metrics for the team becomes more important at scale.

From a process perspective, adopting an experiment-driven approach is essential. We can propose a new feature, venture a hypothesis, and design an implementation to prove or disprove that hypothesis. And the process needs to be backed by data, whether quantitative, qualitative, or both.

One of my favorite anecdotes was hearing an executive relate a story about a new product idea he had. He thought he was holding bottled lightning, and spent an entire day making customer phone calls to pitch the idea for feedback, only to be shot down by every single one. We can admire the tenacity, for sure, but the data- and customer-driven approach saved us months of effort.

There’s a balance — it’s also critical to ensure metrics aren’t sacred. . As the team’s objectives are met, goals and roadmaps shift, metrics that used to make sense might not fit as well in a changing landscape.

When we were building out our data platform, our earliest metrics were designed around supporting a previously-unfathomable scale: query volume, data ingested, number of tables and dashboards. As our internal user base increased, volume became less of a concern and correctness was more important. Keeping everyone on the same page in a growing and changing business, and providing a good user experience to increase adoption of the new data tools became our new goals, with requisite metrics like Net Promoter Score (NPS) and average query latency.

Narrow your strategic focus, let some fires burn, and monitor the costs

As a broader organization grows and the underlying infrastructure grows, teams will, by necessity, begin to specialize. They will need to narrow their focus and put more wood behind fewer arrows. At this stage, it becomes important to ruthlessly prioritize the roadmap for what is absolutely critical, and to be clear with stakeholders about what the team does and doesn’t do, and what their ‘API’ will be with the business.

In certain cases, despite our best efforts, teams may even need to let some fires burn. Strategically letting small fires burn allows you to focus on what will move the needle, and ‘earn the right’ to tackle the other problems later. It’s important to keep track of the fires we’re letting burn so they don’t get out of hand, and we also need to acknowledge the unseen fires of opportunity costs. Having to put out too many fires means there are other roadmap items that aren’t getting built while we’re putting them out.

Across teams, it can help to make unplanned work visible to stakeholders, and to give it some ‘cost’ for requesters. In one team, despite our best efforts, we had become somewhat of a ‘help desk’ organization and were absorbing work from other teams because we just wanted to be helpful. But it came at a cost as we weren’t making progress against our own goals and roadmaps. Our solution was to add just a little bit of friction to the intake process: to ask requesting teams to come with a more fully-fledged set of requirements, what had been tried, and a demonstration of some skin in the game. Oftentimes teams could solve their own problems after digging in a bit further, and when they needed additional help, the added context allowed for a deeper discussion of priority and urgency.

Stage 3: Automating with technology

Hardening and extending

At this point, parts of our architecture will be well-understood and will have a lower rate of change. These services can be hardened against scale and become core infrastructure that others depend on. At the same time, to maintain flexibility, we can create open APIs and platforms to allow others to build on top of them in new and experimental ways.

For core services and stabilized infrastructure, we’ve adopted ‘Rule zero – don’t be on fire.’ Using tools like DataDog and PagerDuty have been instrumental in how we accomplish that on our teams. With metrics and deep observability of system health, we can identify unexpected or erroneous system behavior, proactively alert the team, and coordinate and remediate any issues. And this extends into our data warehouse and data products, where we use tools like MonteCarlo to monitor data flows and schemas. This enables us to proactively address data changes and anomalies before they affect operational decision-making.

Operational tooling

Now is also the time to prioritize building operational tooling. With standard processes in place, scripting and UIs can be built to reduce the cost of those processes and operations on the product and team. Work with your product partners to build ‘the product that runs the product.’

In each team at Upside, we start by collecting utilities and scripts into a single repo, then eventually wrap it into a CLI tool that becomes a ‘command center’ for a team member. Utilities that give insights into the system, allow for introspection, and automation of common tasks all get bundled into a one-stop-shop to ease the burden of toil.

Conclusion

After this journey, we’ve scaled through people, process, and technology, and we have the foundations to repeat the process. We can now send out new pioneers to grow our product footprint or find new markets to enter. We’ll scale up with more settlers, institute standardized processes to achieve more scale, and finally, bring in the town planners to squash toil and build the next foundations for the next journey.

Pager duty