CrowdStrike Aftermath: Preparing for the Next Major IT Outage
Many of us remember the Y2K scare—the fear that computers worldwide would shut down on 1/1/00, reversing society to the mid-20th century.
Thankfully, that catastrophe never happened. However, on Friday, July 19, 2024, we experienced something eerily reminiscent of those concerns: the most significant IT outage in history.
The CrowdStrike outage/incident/disaster—whatever you want to call it—wasn’t a cyberattack or a Microsoft failure, but a logic error in an update from CrowdStrike that caused 8.5 million Windows devices to crash and experience the Blue Screen of Death (BSOD).
This incident underscores a crucial point: while you can’t foresee every possible business disruption, you can—and must—prepare for them.
Your focus should always be on resiliency. Maintaining business continuity amid disruptions and recovering as quickly as possible after an incident is the difference between keeping your business doors open and shutting them forever.
Your board, your clients, and all your stakeholders will soon be asking questions—if they’re not already—about how this happened and how you can prevent it from happening again.
What can you do today—and in the coming days, weeks, and months—to become more resilient and better prepared for the next incident?
Anatomy of the Outage
Feel free to skip this section and get right to the steps you should take. But for those who want a little more information about why this outage happened, read on.
The CrowdStrike outage wasn't a security breach or a Microsoft Windows flaw. Instead, it was a CrowdStrike software error—delivered via an automatic update—that accessed the 'right' kind of files on Windows devices, causing a catastrophic failure across millions of computers.
This specific piece of CrowdStrike software has high privileges on Windows machines, allowing it to monitor operations across the operating system in real time. A logic flaw in a sensor configuration update caused CrowdStrike’s software Falcon to crash, and subsequently, Windows crashed.
Nearly every industry sector was affected: airlines, public transit, healthcare, financial services, media and broadcasting, and many other organizations across the globe, including a large percentage of small businesses in the US.
#1: Conduct a Business Impact Analysis (BIA)
A Business Impact Analysis helps identify your critical business functions and the resources supporting them. It assesses the effects of disruptions on your operations, whether from cyberattacks, natural disasters, or software errors like the CrowdStrike incident.
A BIA helps you make informed risk management decisions based on the likelihood and impact of any given risk.
It’s hard to imagine the decision-makers at Delta had a grasp on the lasting impact the CrowdStrike outage would have on their business. Delta Airlines continued to scrap hundreds of flights for a week following the incident—canceling more flights in that timeframe than it did in the previous two-plus years.
A business impact analysis assigns actual costs to each risk, guiding the creation of plans and policies that allow you to prepare accordingly.
- Identify Critical Functions: Determine which business functions are essential for your operations.
- Assess Impact: Evaluate disruptions' financial, operational, and reputational impact. Each disruption should account for special timing that could amplify the situation’s impact. Also, be sure to consider dependencies within your organization. Identify where problems will start cascading to other areas, ramping up the business interruptions and costs.
- Set Priorities: Rank the importance of each function to prioritize recovery efforts. To set your priorities, start calculating the costs of various disruptions on your list. Knowing the costs will help you establish recovery time objectives (RTOs) and recovery point objectives (RPOs) in each risk area
- Develop Strategies: Create strategies to mitigate risks and ensure continuity.
Just having a BIA isn’t enough. It must be regularly updated and optimized to reflect changes in your business environment, involving key stakeholders to gain comprehensive insights and integrate it with your overall risk management and business continuity planning.
#2: Incident Response, Disaster Recovery, and Business Continuity Planning
It's not enough to have reactive measures. Proactive planning is the key to minimizing damage and ensuring your organization can continue operating smoothly. Focus on developing and regularly updating your incident response, disaster recovery, and business continuity plans. Each plays a different—but equally important—part in protecting your organization.
For all three, include these key elements
- A designated point of contact (POC) and a leader charged with heading up the effort in a specific area.
- A schedule for updating the plan.
- A schedule for testing the plan—like tabletop exercises.
Incident Response
Explore in depth how to create an incident response plan.
Outline clear steps for identifying, containing, and eliminating threats, along with procedures for communication and coordination among response teams
- Identify Threats: Quickly detect and assess the nature of the disruption.
- Contain and Eliminate: Implement measures to contain the impact and eliminate the threat.
- Communicate and Coordinate: Ensure clear communication channels and coordinated actions among response teams.
Disaster Recovery
Read more about the key elements of a disaster recovery plan.
A disaster recovery plan should include strategies for data backup, system restoration, and the use of redundant systems to maintain critical functions. It also provides a clear roadmap for response and recovery
- Data Backup: Regularly back up critical data to secure locations.
- System Restoration: Develop procedures to restore systems quickly.
- Redundant Systems: Use backup systems to ensure continuous operation.
Business Continuity
Already have a business continuity plan? Discover how to improve it.
A business continuity plan involves creating alternative work processes for critical processes, developing communication plans, and training employees on their roles during an outage. Align these plans with your strategic objectives to ensure operational stability and customer trust
- Alternative Work Processes: Develop, document, and practice procedures for critical processes without digital systems. This was common before the digital age and can serve as a fallback in an event like the CrowdStrike outage.
- Communication Plans: All relevant personnel should know what needs to happen in the event of an incident, when it needs to happen, and how it needs to happen. They also need access to the plan wherever it may reside.
- Employee Training: Regularly train staff on their roles in continuity efforts.
#3 Third-Party Risk Assessments and Management
Best practices for third-party risk management.
Vendor relationships are integral to your operations, but they also introduce vulnerabilities. Very few could have predicted a bad line of code from a security application would bring on the BSOD. It does serve as a stark reminder of the interconnected nature of our digital ecosystems.
Third-party applications, like CrowdStrike, are deeply intertwined with our systems, amplifying the impact of any issues they face. That’s a risk that most are willing to take because of the value many vendors provide—but it doesn’t mean we can’t mitigate some of those risks.
Regularly evaluating and managing your third-party relationships is crucial for maintaining organizational resilience.
- Identify and Classify Vendors: Start by identifying all third-party vendors and classifying them based on their level of access to your systems and the criticality of their services. CrowdStrike had extremely high access to Windows systems, allowing its failure to spread to the operating system
- Conduct Risk Assessments: Perform thorough risk assessments to understand the potential impact each vendor could have on your organization. This includes evaluating their security practices, compliance with regulations, and their own resilience plans
- Implement Risk Management Strategies: Develop and implement strategies to mitigate identified risks. This can include establishing clear security requirements, creating contingency plans for vendor-related disruptions, and ensuring contracts include clauses for regular audits and compliance checks.
- Monitor and Review: Continuously monitor the performance and risk profile of your third-party vendors. Regularly review and update your risk management strategies to adapt to new threats and changes in the vendor landscape
The Path Forward
Outages are inevitable, whether due to software errors, cyberattacks, or natural disasters.
The key is not to strive for immunity but to build resilience.
With a business impact analysis to understand the potential effects of disruptions, comprehensive incident response plans to address and mitigate threats quickly, disaster recovery plans to restore systems and data, and business continuity plans to maintain essential functions, you can minimize downtime and better ensure critical functions remain operational.
By regularly conducting third-party risk assessments and implementing strong vendor management strategies, you can curtail many vulnerabilities introduced by third parties.
Organizational resilience should be holistic and have a top-down approach. When this is aligned to your strategic business objectives, you can better prepare for the next disruption.
How HBS Can Help
Helping organizations build resilience is one of the things we are most passionate about at HBS. Our desire is that your business is best prepared for disruptions and that your strategies are based on risk, not fear.
From conducting business impact analysis to developing detailed incident response, disaster recovery, and business continuity plans, performing risk assessments, or even providing technology and information security leaders, HBS excels in pushing your business towards resiliency.
Contact us today to learn how we can help you fortify your organization.