Systems are more than just technical. They are designed, built, monitored, maintained – and sometimes broken – by humans. Systems are sociotechnical.
That’s why putting people at the center of processes is essential when creating and maintaining successful systems. In this article, I will share what a sociotechnical approach to system management looks like; why centering your engineers is important for closing the context gap and breaking down silos; and how you can get started in your organization.
There are a couple of questions I like to ask people working in tech: How many projects were you involved in that failed mainly because you feel the technological choices were wrong? And how many projects can you count where instead you believe the failure was mainly related to breakdowns in communication and shared understanding?
Usually, and for me personally, the answer to the first question can be counted on one hand, maybe two. The second question, however, feels like it’s uncountable, which is why there is a third question I like to follow up with: Do you feel you’re spending most of your time focusing on the right one of these questions?
That third question has long gnawed at me. The more time I spend in this industry, the more interested I become in the social aspect of our systems. As a software developer, it’s tempting to frame the software as an independent system that you work on – you can make the system as correct, fault-tolerant, and robust as possible. If a problem comes up, it can be addressed in code, hardware, and through automation. With enough time and effort, you can get things in a state where they’ll be working, and hopefully self-running. But the flaw in this approach is that the world is a dynamic environment, and all our targets are shifting constantly.
We have to remember that the humans who write, operate, and influence the software are also part of the system, and we have to broaden the scope of what our system is in order to encompass people. A clear, visual explanation can be found in the STELLA report’s above-the-line/below-the-line framework. A framing of systems as being both social and technical (sociotechnical!) is pivotal. We don’t have to abandon efforts to make the technical aspects of the systems robust, but we have to be aware that its ongoing survival is a question of adaptation in the face of surprises.
Closing the context gap through humans
This is all related to something cognitive engineers call the context gap. In short, all solutions and strategies we have to solve a problem are contextual and depend on the environment and situation you’re in. People tend to be good at observing and being aware of that context, picking up a few relevant signals, and building a narrow solution that works for a restricted set of contexts. More importantly, we have the ability to notice when a solution no longer fits its context, and then broadening the approach by modifying the solution.
On the other hand, automated components tend to start at the focused, narrow end. In practice, they do not have the ability to properly evaluate their environment and their own fitness, and neither are they good at ‘finding out’ that the signals they’re getting are no longer relevant to the overall system success. They therefore cannot really ‘know’ that they need to adjust. Closing the context gap requires the ability to evaluate and perceive things in ways our software systems can’t yet do as reliably – or nearly as effectively – as people. A good solution will require people in the loop if we want it to remain successful over time, because the world is dynamic and requires continuous adjustment.
The light switch example
Take this illustration as an example. The solutions we build (to the right) are a response to a simplified model (middle) we make of the real world (left). As a consequence of this, a solution is often bound by the model’s scope, and cannot adequately know when its strategy is no longer adequate nor adjust to the complexity of the real world. It is no surprise that almost all light switches with a motion sensor also come with manual control to allow humans to take over when the automated solution fails.
This underlines how important human mechanisms such as chaos engineering or blame-aware incident reviews can be – they help us to figure out where and how things break down, and allow us, as teams, to re-synchronize and update our individual mental models of the whole sociotechnical system. We can use these observability tools to continually improve our solutions and our organizations. They do not replace technical solutions, but they frame and guide the form our solutions should take, making them more effective, useful, and sustainable.
Observability is the property of some systems to be interpretable, and of making the work of extracting meaning easier. When our systems behave counter to our expectations, observable systems make it easier to figure out what is happening, and therefore to map out effective solutions.
Observability isn’t something systems have; it’s something they do. It should be used as a verb, and much like resilience and adaptive capacity, it tends to show up – and is most helpful – in critical, high-pressure situations.
If we aim to bridge the gap between dev and ops, to hopefully resolve errors quicker, and to reduce the time spent firefighting, then enriching the relationship of the socio and the technical parts of our systems is likely to be a good investment.
Blamelessness is another important feature of a sociotechnical approach. It’s essential to build psychological safety for folks involved. It has become more and more popular to talk about blamelessness, but many of us are still far from a great position. There is a tendency to interpret blameless as meaning no retribution – which is a nice and necessary starting point – but then end the process there. It’s a little funny when people say, ‘We do not want to blame anyone for this event,’ but otherwise still frame everything as a human error, with undertones that everything is someone’s fault and we’re just being nice by not naming them. This is noticeable by a general attitude where any of the following can happen:
- Human error is called out as a cause that needs remediation
- People are seen as problematic around operations because they’re unpredictable
- Automation is perceived as necessarily good since it takes people out of the loop
- Folks have an attitude of, ‘next time, we’ll just be more careful and do a better job’
The above falls under shallow blamelessness – it avoids retribution, but still finds fault against individuals and frames most corrective work in controlling people. Better sociotechnical approaches see humans as a key factor of success, view their unpredictability as signs of adaptive work taking place, and do not attempt to take them out of the loop but rather try to enhance their capabilities. Aligning the socio and the technical parts of the system requires more than awareness; it requires a different perspective altogether.
Although there are many vendors providing interesting tools that can be used to great effect, successful companies understand that tools aren’t enough. The human factors within a sociotechnical system are critical to our ability to create, and more importantly maintain, adaptive and successful solutions. Provide your people with rich views of their components through observability, and rely on their abilities and talents for understanding context. Mix them all together, and make sure learning is always a priority by maintaining psychological safety.