Reliability and more: preventing a crisis in engineering production

Leveraging tools, teams, and test-driven development for reliable releases.

By Ricardo Aravena

June 02, 2021

You have 1 article left to read this month before you need to register a free LeadDev.com account.

How many times have you considered the type of airplane you were on? How many times did you check that you were not flying on a 737 MAX? If you are like me, you probably checked every single time.

Two critical failures have turned Boeing and the aviation industry upside down over the last years. The first one, the crash of Lion Air Flight 610 on October 29, 2018. The second, the crash of Ethiopian Airlines Flight 302 on March 10, 2019. The main reason for both crashes was a software defect in the Maneuvering Characteristics Augmentation System (MCAS) that caused both airplanes to nosedive while the pilots lost control.

I suppose you are wondering what the MCAS has to do with your everyday engineering software releases. The MCAS is software just like the one you may be responsible for, but because many lives are at stake when controlling critical aspects of flying, the software has to go through rigorous quality controls.

The MCAS is a complex system. It receives information from the angle of attack (the angle between the wing and the airflow) sensors on the front of the plane; if the angle of attack is too high, it will automatically activate to prevent the airplane from stalling, take control of the horizontal tail, and make the airplane’s nose move downwards. But what if the data from the angle of attack sensors is wrong? Can the MCAS connect to other systems to detect something amiss and deactivate itself? The short answer is that for these two accidents, it couldn’t.

The importance of test-driven development

Understanding modern airplane pieces, computers, and how they work together appears to be a very daunting puzzle. You may think that your software doesn’t have to go through rigorous controls, but it’s still most likely that your end-users or customers notice when you ship unreliable software, leading to their frustration and incurring long-term reputation damage to your organization. When the 737 MAX MCAS failed to deactivate itself, it was very analogous to a production microservice failing to activate its circuit breaker.

Why wasn’t there a failsafe mechanism to deactivate the MCAS? The most obvious answer is that Boeing failed to come up with enough test cases for the MCAS. And there were probably internal political battles at stake to top that off. For example, how much pressure was there to get the 737 MAX out the door as soon as possible and skip much-needed quality protocols?

Using Boeing’s learnings, we can deduce that failing to have processes like test-driven development, quality assurance (QA), and active monitoring can lead to catastrophic results threatening an organization’s existence. But what about agility? The counterargument says that your customers are more likely to experience outages and bugs if you are more agile. That’s where tooling and team structure can help.

If you look at modern microservices architectures, there is a frontend (or frontends), a series of microservices interconnected, and some sort of backend like a database where the state is kept.

Like with the 737 MAX components, each of these software components needs to be tested individually and work together as a set with all their dependencies. A common methodology that might help (and that several organizations are adopting nowadays) is the tribe/squad/guild model pioneered at Spotify.

Structuring your team

We can create multiple tribes around the business objectives, and within squads, each of the individual squads would own the microservices. Each independent squad contains all the members needed to be successful with a given microservice: a lead, front and backend developers, SRE and QA engineers, and a scrum master. Furthermore, each squad is responsible for managing their own public cloud provider accounts that allow them to be agile in their way. Finally, each squad would also handle their chaos engineering testing and incidents when they happen.

The guilds that members of multiple tribes/squads create would establish standards. Those standards would include how microservices talk to each other, what tooling would best fit the organization, and the timeout for the circuit breakers between microservices. Additionally, they would need to consider how many code review approvals would be needed before releases, what service scaffolding they need to use when creating a new microservice, the preferred computer languages, and how squads manage their incidents, etc.

Leveraging the tools available to you

What about tooling? We are lucky to continue to see an incredible amount of tool development in the open source community. The first open source project that comes to mind is Kubernetes which allows you to run redundant and reliable workloads. But there is also a myriad of different open source projects addressing all the other areas of the cloud native ecosystem, such as security, observability, and storage. Each can be used as lego building blocks to provide fully reliable infrastructure and applications to suit your needs.

Furthermore, over the last ten-plus years, we have seen an explosion of cloud and DevOps tools with the advent of Amazon Web Services (AWS). Many of them are unique in addressing the specific pain points of managing complex microservices’ architectures. This helps the organization balance between building in-house, leveraging open source, or spending on a vendor that will speed up solutions to address pain points.

Suddenly, what seems like an unsolvable puzzle with many interconnected dependencies becomes more manageable with more regular and reliable releases. We can have organizations around agile mini teams and tooling from cloud native vendors or open source projects. Over the long term, all this will help your organizations produce more reliable software and it could figuratively save your software parking lot from looking something like the parking lot next to the Boeing factory.

About the author

Ricardo Aravena
- @raravena80
- raravena80
- raravena
- Blog

Newsletters

Webinars

Videos

Reports

For you

New York

Berlin

London

Meetups

Reliability and more: preventing a crisis in engineering production

By Ricardo Aravena

The importance of test-driven development

Structuring your team

Leveraging the tools available to you

About the author

Ricardo Aravena

New York

Berlin

London

Meetups

Reliability and more: preventing a crisis in engineering production

By Ricardo Aravena

The importance of test-driven development

Structuring your team

Leveraging the tools available to you

Share:

About the author

Share:

More like this