What can engineering leaders learn from one of the biggest IT disasters in history?
On July 18, the popular endpoint security provider CrowdStrike released a software update that caused, at minimum, more than 8.5 million Microsoft devices to crash. Although that’s less than 1% of Windows devices, the impact was much broader – planes were grounded, card payments were disabled, and hospitals had to rearrange appointments.
While we all wait for the official post-mortem to be released, here are five lessons all engineering leaders can learn from this incident.
1. Make progressive delivery a priority
Microsoft’s official statement said this incident reminds us “how important it is for all of us across the tech ecosystem to prioritize operating with safe deployment and disaster recovery using the mechanisms that exist.”
In part that means no release should be done without rollbacks, feature flags, blue/green, A/B or canary testing, and other methods of progressive delivery in place to control the blast radius when things go wrong.
Andrés Vega, founder of Messier 42, identified a set of core “robust release processes” in a CNCF blog post in the aftermath of the incident, including:
- Comprehensive testing
- Integrity verification
- Staged rollouts
- Quick rollback mechanisms
- Transparent communication
- Proper encryption and recovery key management
While this incident took place on a Thursday, “If your response is ‘don't deploy on Friday,’ then it says more about the poor state of your systems and that they cannot be trusted, than it does about CrowdStrike's processes,” Jason Yee, staff advocate at Datadog, posted on LinkedIn. “Everyone in tech loves to do some armchair quarterbacking when it comes to major incidents, but the many ‘don’t deploy on Friday’ takes really expose how out-of-date some folks are with modern reliability practices.”
One of the best ways to minimize blast radius is to dogfood your own releases. As Gergely Orosz asks in his The Pragmatic Engineer newsletter:
- Was this change rolled out to CrowdStrike staff, before release to the public?
- If yes, did some CrowdStrike employees also see their operating system crash?
- If yes, then why did the rollout proceed?
- If there was dogfooding, but no employees’ machines crashed; an interesting question is: why not?
Dogfooding with progressive delivery will make your release process more resilient, but also requires a blameless culture where everyone feels safe to speak up if they see something wrong.
2. Always have a backup plan
You need both a business continuity and a disaster recovery plan in place. Otherwise you’re putting your service level agreements – and your whole business – at risk. If a cloud provider goes down, a ransomware attack happens, or your data is suddenly corrupted, hacked or deleted, you need to make sure you have redundancy, data backups, and an analog plan of action.
Yes this was a technical defect, but across so many companies affected, it was a lack of a backup plan that glared brightly. Whether you are a cash-free cafe, a multinational airline, or a hospital, you need to have devised and trained your employees on how to deal with a sudden shift to analog. At the very least you may need key records printed out and a lockable cash box so you can continue to operate when systems go down.
3. Ask who you are outsourcing security to
The recent hacking of third-party pathology provider Synnovis that led to many London hospitals canceling all non-emergent appointments was another good reminder that you can’t just rely on your own backup plans. In response, a forthcoming cybersecurity and resilience bill will require third-party services provided to the UK government to strengthen their cybersecurity – although details of exactly how are currently thin on the ground.
You shouldn’t wait for the government to intervene however. Regularly review the security of your third party integrations and ensure you have backup plans for each provider going down. Avoid single points of failure, or at least make sure you are aware of where they are.
“This incident demonstrates the interconnected nature of our broad ecosystem – global cloud providers, software platforms, security vendors and other software vendors, and customers,” Microsoft said in its statement.
Organizations are habitually underinvesting in IT, blindly trusting third-party vendors as the magic way to stay compliant. “That's why an update can be pushed directly through production systems without anyone in the bank/hospital/airline testing it out first. The mix of tech monopoly, regulatory capture and some broken code resulted in the perfect storm, taking out global infrastructure,” open source consultant Tracy Miranda posted on LinkedIn.
And, as Orosz noted, this incident is an important reminder that your software can be broken not just by your code, but by your dependencies and vendors.
Jan Kammerath, Head of IT at Dertour DMC Network, argued on his blog that endpoint protection software is inherently dangerous because it has to be given privileged access to the operating system to be effective. “Given the fact that the CrowdStrike driver (they call it a “sensor”) is so deeply nested into Windows and also bypassed safeguards of Windows, it took out the entire operating system like we’re back in 1994,” he wrote.
4. Consider employees as stakeholders too
Building a blameless culture that can withstand this type of crisis starts at the top. A defect of this magnitude being pushed into production without the proper checks in place could be a signal that the organization prioritizes speed over quality, reliability and security.
While several anonymous employee reviews on Glassdoor boast of a positive rapport with colleagues, great pay, and access to cutting edge technology, the number one con for CrowdStrike is poor management, with some reviews specifically calling out the CEO, George Kutz. The CrowdStrike official account only replies asking employees to de-anonymize themselves by sending emails to HR.
In his apology letter, Kutz addresses “valued customers and partners,” but not his staff for their efforts to mitigate the problem.
5. Be kind
As Yee pointed out, engineering is a field rife with backseat drivers. Think before you post about how you would have done things differently unless you have all of the context.
“Blaming software engineers is nothing more than satisfying the bloodthirsty public for your organizational malpractices,” wrote senior software engineer Dmitry Kudryavtsev on his blog. “You won’t solve the root cause of the problem – which is a broken pipeline of regulations by people who have no idea what are they talking about, to CEOs who are accountable only to the board of directors, to upper and middle management who thinks they know better and gives zero respect to the people who actually do the work, while most of the latter just want to work in a stable environment where they are respected for their craft.”
Engineering is hard, and we should all be in it together. Don’t be quick to pass blame, learn from others, and when things go wrong, #HugOps.