The recent CrowdStrike outage, triggered by a faulty software update on July 19th, 2024, serves as a stark reminder of the vulnerabilities inherent in our digital infrastructure. Millions of Windows machines around the world were crippled by the “blue screen of death,” causing significant disruptions across critical sectors like airlines, healthcare, and finance. This event offers valuable lessons for IT organisations, highlighting the importance of preparedness, redundancy, and robust business continuity plans.

The Cause of the Chaos

The culprit behind the global outage was a logic error within a CrowdStrike update intended to enhance their Falcon sensor’s threat detection capabilities. This error, triggered during a routine configuration update, resulted in widespread system crashes. The event underscores the potential consequences of software bugs, even in security solutions designed to protect us.

, A Post-Mortem: Lessons Learned from the CrowdStrike Outage

Impact Across Industries

The fallout from the outage was far-reaching. Airlines were forced to cancel thousands of flights, causing travel chaos. Hospitals grappled with disruptions to patient care, with some surgeries postponed and emergency services experiencing outages. Financial institutions also suffered, with disruptions in payment systems and customer access. Currently, the impact on Australian businesses is projected at losses of over $1 billion. This incident serves as a wake-up call, highlighting our dependence on stable IT infrastructure for critical services.

Learning from the Disruption

Several key takeaways emerge from this incident:

    • Scenario-Based Testing: Develop simulations that mimic real-world scenarios, such as software updates gone wrong, cyberattacks, or natural disasters. This helps identify potential bottlenecks and areas for improvement.
    • Regular Updates and Reviews: DR plans are not static documents. Regularly review and update them to reflect changes in technology, threats, and organisational structure.
    • Integration with Business Continuity Plans: DR plans should seamlessly integrate with broader business continuity plans, ensuring a holistic approach to mitigating downtime and maintaining essential operations.
  • Data Backups are Essential: Regularly backing up critical data and storing it in secure, off-site locations is paramount. This ensures a quick recovery path in the event of disruptions. Here are some best practices for data backup:
    • The 3-2-1 Rule: Maintain at least three copies of your data on two different storage mediums, with one copy stored off-site. This ensures redundancy in case of localised disasters.
    • Regular Backup Testing: Don’t assume your backups are functional until you test them. Regularly conduct test restores to ensure data integrity and a smooth recovery process.
    • Data Classification and Prioritisation: Classify data based on criticality and prioritise backups accordingly. This ensures faster recovery of essential applications and minimises downtime.

, A Post-Mortem: Lessons Learned from the CrowdStrike Outage

  • Redundancy Mitigates Risk: Implementing redundant systems and applications can minimise downtime during outages. This allows businesses to continue essential operations by seamlessly switching to backup systems. Here are some ways to achieve redundancy:
    • Network Redundancy: Implement redundant network paths and failover mechanisms to ensure business continuity in case of network outages.
    • Server Mirroring: Utilise server mirroring or replication to maintain a synchronised copy of critical servers, allowing for a quick switchover during disruptions.
    • Cloud-Based Backup and Recovery: Leverage cloud-based solutions for data backup and disaster recovery. Cloud platforms offer scalability and geographically dispersed storage, enhancing redundancy and reducing downtime.
  • Communication is Key: Clear and timely communication with stakeholders during outages is crucial. IT teams should have established protocols for disseminating updates and maintaining transparency. Here are some communication best practices:
    • Define Communication Channels: Establish designated communication channels for internal and external stakeholders during outages. This ensures everyone receives timely and consistent updates.
    • Transparency is Paramount: Be transparent about the nature of the outage, the progress towards resolution, and the estimated recovery timeframe.
    • Proactive Communication: Don’t wait for inquiries. Proactively communicate updates to stakeholders, even if it’s to inform them that there are no new developments.
  • Scrutinise Software Updates: Organisations should implement a rigorous testing process for all software updates before deployment. This helps mitigate the risk of errors like the one that caused the CrowdStrike outage. Here are some software update best practices:
    • Multi-Stage Rollouts: Avoid deploying updates to all devices simultaneously. Instead, consider a phased rollout approach, starting with a smaller test group and gradually expanding.
    • Testing Environments: Utilise dedicated testing environments to vet software updates thoroughly before deploying them to production systems.
    • Change Management Processes: Integrate software updates into established change management processes, ensuring proper review, approval, and rollback procedures are in place.

The Road to Resilience

By embracing these lessons, IT organisations can build resilience against future outages. Investing in robust disaster plans, data backups, and redundant systems will minimise downtime and ensure business continuity. Additionally, fostering a culture of continuous improvement through regular testing and evaluation will strengthen incident response capabilities.

Beyond the Basics: Building a Culture of Resilience

The CrowdStrike outage serves as a springboard for fostering a culture of IT resilience within organisations. Here are some additional considerations:

  • Invest in Employee Training: Regularly train employees on cybersecurity best practices and incident response procedures. Empower them to identify and report suspicious activity, potentially mitigating the impact of future outages.
  • Embrace Automation: Utilise automation tools to streamline tasks such as data backups, system patching, and failover procedures. This frees up IT staff to focus on more strategic initiatives while minimising human error in critical processes.
  • Security by Design: Integrate security considerations into all stages of the IT lifecycle, from system design and development to deployment and ongoing maintenance. This proactive approach helps prevent vulnerabilities and strengthens overall IT resilience.
  • Incident Response Exercises: Conduct regular incident response exercises to test and refine your team’s ability to respond to outages and security threats. This allows for early identification of weaknesses and continuous improvement of response protocols.

, A Post-Mortem: Lessons Learned from the CrowdStrike Outage

The Evolving Threat Landscape

The digital landscape is constantly evolving, and so are the threats we face. By adopting a proactive and holistic approach to IT security, organisations can build resilience against a wide range of potential disruptions. The CrowdStrike outage serves as a stark reminder of the interconnectedness of our world and the critical role IT infrastructure plays in modern society. By learning from this event and prioritising IT security preparedness, organisations can ensure they are equipped to navigate future challenges and minimise disruptions to their operations and customers.

Backup and disaster recovery solution development, management, and maintenance 

Need assistance in developing or running your disaster recovery and resilience strategy? We can help. At Otto, a highly-accredited MSP in Melbourne, we’re here to protect your organisation and your people against ransomware, natural disasters, IT scams, and other IT security risks your data faces on a daily basis. We’ll help you educate your team, ensure you have the best IT protection for your business, and be ready to act if your data or people are compromised. Whether you have remote teams, need a backup solution, or aren’t sure what provider is best for your business, we can help you out. Just think of us as your IT department.

, A Post-Mortem: Lessons Learned from the CrowdStrike Outage

Written by

Jordan Papadopoulos

Jordan is the Chief Commercial Officer at Otto. Jordan is here to help clients remove roadblocks and achieve the business goals they’ve set out. Jordan’s biggest focus is Customer Experience, Business Relationship Management, Risk Management and Strategy.