The CrowdStrike Fumble on 7/19

In case you’ve been living under a rock, cyber security provider CrowdStrike deployed an update to production in the wee hours of the morning this past Friday. This update introduced a bug that caused most Windows operating systems to crash resulting in the BSOD aka blue screen of death. 

First of all, this is a reminder that humans are imperfect, humans write code (or sometimes write programs that write code) which can also be flawed. NOBODY is perfect. Bugs in production happen all the time which is why most software in the world has “minor bug fixes” in the release notes of every update. 

Having said that, this widespread outage could have easily been avoided. Of course, better error handling in the code, robust regression testing, additional validation checks, etc all could have helped prevent this by catching the bug prior to rolling it out to prod. But there is a much simpler testing strategy that would have avoided this from becoming the widespread issue it was and it doesn’t even involve code. The answer? Canary Testing or Phased Rollout.

As a professional in software, I am beyond shocked that a company as large as CrowdStrike did not have Canary Testing as part of their rollout/software development lifecycle processes. Canary testing is a software deployment strategy that releases a new version of a product to a small subset of users before rolling it out to a larger audience. Once each subset runs the new version for a while without major issues (aka BSOD), it is rolled out to slightly larger percentage of the user base. This approach allows developers to monitor the new version’s performance, stability, and user feedback in a controlled environment. And most importantly, if any issues are detected, they can be addressed before a full-scale release. 

Apparently CrowdStrike does leverage canary testing and dogfooding (aka testing internally in prod before rolling out to general production users), however, does not do this for the Rapid Response Content deployment, which is the update that caused this crash. Any one else shocked by this? Am I missing something here?