CrowdStrike Incident - Lessons Learned and Discourse Initiated
On July 19, 2024, CrowdStrike released a sensor configuration update to Windows systems that triggered a logic error resulting in a system crash and blue screen (BSOD) on affected systems. The incident impacted worldwide infrastructure in industry as critical as hospitals, airlines, and finance. It resulted in a direct impact on human lives. People were returned from the emergency room in hospitals, airline flights were canceled and money could not be retrieved from ATM machines.
However, on the bright side, this one single incident has initiated serious discourse on how to minimize and mitigate such catastrophes from happening in future.
This blog post discusses the wider impact it had on the society around policy, finance and culture as well as deeper in different areas of technology.
Technology
Infrastructure
The incident has raised questions on the design of the current infrastructure of the systems, which are the backbone of the applications that powered the critical services.
In this case, CrowdStrike is an agent running on the host Windows and could there have been a better design. Why would a host have to stop working if one of the agents is a bad actor ? Also, does Endpoint detection and response (EDR) require kernel and driver access ?
The tweet below from Mark Russinovich (@markrussinovich) advocating to use Rust for more reliable and less error in critical infrastructure where non garbage collected language is required
I thought of tweeting this again today for no particular reason… https://t.co/veTbccM7aT
— Mark Russinovich (@markrussinovich) July 21, 2024
Code Reviews, Tests and Deployment
Although this particular incident most likely would not have been caught through code reviews, it has definitely raised the strong need for code review as part of any software development process.
Having a rigorous quality assurance and release process is equally important. Each change including configuration should go through the same process. Large companies should have exactly a similar environment as that of production to test any changes before rolling out to the production environment. Changes should be rolled out in a phased manner so that bugs could be contained and mitigated early before they can have larger impacts.
It is also important to have a backout plan to rollback in case of a production incident such as this without having to re-deploy a patch.
Culture
The discourse around culture that this incident has raised is understanding the impacts that a small bug such as this can have on human lives and investing on those core infrastructure and processes.
Software industry is obsessed with “move fast and break things” but clearly we have realized that with more and more dependence on software in our critical part of day to day lives this is not sustainable.
At the company level, this has raised the importance of investing in making the current systems more reliable such as regularly performing the chaos monkey, disaster recovery ( DR ) exercises etc.
Policy
This incident has raised several questions by public elected officials and those in the government offices.
The Chair of Federal Trade Commission (FTC), has raised the concern of how concentration can create fragile systems in the society :
1. All too often these days, a single glitch results in a system-wide outage, affecting industries from healthcare and airlines to banks and auto-dealers. Millions of people and businesses pay the price.
— Lina Khan (@linakhanFTC) July 19, 2024
These incidents reveal how concentration can create fragile systems.
AI Generated code
There is a rumor the code or configuration changes that created this havoc is generated by AI. I think this is another area where there should be larger discourse and research on building the right guardrails and tools to verify these codes generated by Large Language Models (LLMs) similar to TLA+ in distributed systems.
Finance
The incident has led to almost 20% of stock price to dip with other indirect financial loss such as a dent in their brand and trust. Time will tell how long it will take to bounce back the stock price and recover the trust of the company.
Figure: Stock Price of CrowdStrike on July 21, 2024
A tweet from Dan Luu (@danluu) running a poll on twitter to understand the impact of such incident:
Will Crowdstrike bricking all these computers have a significant long-term impact on the stock price?
— Dan Luu (@danluu) July 19, 2024
Or will this be like basically every other outage or breach, where "the market" quickly realizes that customers don't care and regulatory action isn't forthcoming?
There is also a larger financial loss for other companies who could not continue their business due to this incident. As well as customers who now have to struggle to get back their money for the product and services they could nott use.
Conclusion
The CrowdStrike incident has taught us that it does not take large changes in software and infrastructure to create catastrophic impacts directly impacting human lives due to more and more dependence on software in our day to day lives. This will increase as we leverage more AI and LLMs in future. This is the time for everyone who builds and uses such technologies to do a retrospective on how they can make their systems more reliable and fault tolerant.
Software is no longer a separate piece of code running on its own as it becomes more critical part of our day to day lives.
References
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/