10 Important Lessons CrowdStrike Incident Taught Us

Jul 21, 2024

Blue Screen of Death: Why is my laptop or desktop showing a blue screen error? What's the Blue Screen of Death error and how to troubleshoot it - The Economic Times

On the 19th of July 2024, a faulty update to CrowdStrike endpoint protection software caused mayhem around the world. Windows computers that the update was installed on started showing the infamous blue screen of death. As a result, many flights were canceled, bank transactions couldn’t be processed, and many other services were disrupted.

Arguably, this has been the largest IT outage of all time. At the time of writing, it still hasn’t been fully resolved and many systems are still operational.

Once the problem has been resolved and the affected devices have been made operational again, the most important thing to do in this kind of situation is to learn the lessons from it. Today, I will tell you of eight lessons I’ve learned. Some of these may be controversial, but we must talk about them.

1. Even if you do things right, your reputation can be damaged by the actions of a third party

When the incident occurred, news outlets across the world immediately started talking about how Microsoft was responsible for the outage. However, Microsoft had very little to do with it. It was a software update by a little-known, but huge antimalware company called CrowdStrike.

The only reason Microsoft was nominated as the culprit was because it was Windows devices that were affected. Also, Microsoft is very well-known, while CrowdStirke isn’t (unless you work in the cybersecurity niche).

Even after it became clear who the real culprit was, news outlets and people on social media continued criticizing Microsoft. But was any of this criticism well-founded? Let’s see.

Many people said that Microsoft should have never allowed third-party software to run at the kernel level, i.e. deep inside its operating system. However, while it’s true that no software should run at the kernel level without a good reason, this particular software had a good reason to do it. Anti-malware software has to run at the kernel level because this is what malware tries to do. So this criticism is unfounded.

Other people said that Microsoft shouldn’t have chosen CrowdStrike as a vendor. Well, it wasn’t Microsoft that chose it as a vendor. It was the corporate Microsoft customers who did. Besides, the reason CrowdStrike was allowed to publish installers for Windows was that the company had a very good track record before the incident. So there wasn’t any reason to believe that CrowdStrike is in any way untrustworthy.

Couldn’t have Microsoft done more to prevent this kind of incident from happening? Maybe. But Microsoft has already done quite a lot in terms of due diligence. So, this is a prime example of how, even if you do things right, an action made by a third party can affect your reputation.

2. Being big and powerful is both an asset and a liability

Having a large market share is nice. It ensures an unending stream of revenue. However, as this incident demonstrated, it doesn’t come without its tradeoffs.

Both Microsoft and CrowdStrike were forced by the incident to pull people off various projects to help the affected customers. Therefore, while both these companies are big and powerful, they were forced to deal with the fallout of the incident.

The fallout was huge, as over 8.5 million devices were affected. To add to the headache, every one of these devices needed to be fixed manually. It wasn’t necessarily either of these companies that was fixing it, but both still needed to provide the necessary tech support for the users to do it.

As the famous movie quote says, with great power comes great responsibility.

3. Even if you are big and powerful, a small mistake can ruin you

It became apparent that the cause of the incident was a small bug in the software that perhaps spanned no more than a few lines of code. To be more precise, it was the so-called null pointer exception. In a nutshell, this is where the code points to an invalid memory address. Since there’s nothing at that address, any subsequent action will result in a software crash.

In terms of the lines of code, it was probably just a small mistake. Maybe a missing line. Maybe a typo. But in terms of the consequences, it was huge, especially for the company involved.

At the time of writing, the price of CrowdStrike shares dropped by almost 20% and didn’t recover. For a company worth $80 billion, this represents billions being wiped off its value.

But this is not where the consequences for CrowdStrike stop. Once the dust settles, there will be legal actions by the affected organizations, of which there are many. CrowdStrike may or may not survive this.

This proves that, no matter how big you are, something seemingly small can be the source of your downfall.

4. Proper tests are a necessity and not a luxury

There are many speculations as to what exactly CrowdStrike did wrong to have caused this incident, but one thing is absolutely certain. The software wasn’t sufficiently tested before it was released.

This is a common problem in IT. Most of the time, it doesn’t cause incidents of this magnitude, but it’s a problem nonetheless. Way too many engineering managers and executives believe that automation testing is an unnecessary luxury that only slows the development process down. But every good software engineer knows that automation testing with proper test coverage is a necessity and not a luxury.

Yes, engineers argue about the specific approaches. For example, some like test-driven development while others hate it. However, any good engineer who has been in the industry long enough will agree that the software needs to be tested properly before it comes out.

Software of sufficient complexity cannot be tested properly unless an automation testing pipeline is in place. Yes, any individual new features and recent bug fixes can be tested manually. But only existing automated tests can guarantee that the new changes haven’t broken any existing functionality.

If such a serious bug in CrowdStrike software could come through, we can only conclude that the software wasn’t tested thoroughly enough and definitely didn’t have sufficient automated test coverage.

The next lesson will probably be controversial and will almost certainly ruffle some feathers. But before I tell you what it is, I wanted to shout out to my affiliate partner, an online learning platform called Educative.

The platform provides a wide variety of text-based programming courses covering many different programming languages and technologies. But the best thing about this platform is that it’s fully interactive. As a learner, you don’t have to set up your development machine before you can practice. You can start doing it directly in your browser.

But perhaps the best thing about this platform is that it’s an absolute bargain. It operates a monthly subscription model, which is probably cheaper than a Netflix subscription. To find out how cheap it is, click on the logo above.

Now, back to the lessons learned from the global IT incident.

5. It’s time to ditch unsafe programming languages like C and C++

This point is controversial because many programmers out there specialize in C and/or C++. Also, many systems that we use every day are written in either of these languages.

However, a programming language is just a tool. A software engineer must be an engineer first and foremost. No engineer should get overly attached to the tools he or she uses.

The problem with C and C++ is that these languages run at a low level of abstraction and manage the computer’s memory directly. Pointers, which is a feature that allows the code to point at specific addresses in the memory, is a standard feature of these languages. This feature is often used by C and C++ developers even when it’s not necessary and it’s precisely this feature that caused the bug in the CrowdStrike software.

It’s without question that this software needed to use the language with a low level of abstraction, as it runs at the kernel level. However, other languages exist that can do anything that C or C++ can do while being much safer than either of these languages.

Rust is a primary example of such a language. While it also has pointers, they are disabled by default. To enable them, the block of code needs to be explicitly marked as unsafe.

While there is no guarantee that using Rust would have prevented this bug, it would have certainly made it harder for such a bug to appear. While there is nothing that stops developers from writing unsafe code with pointers, having an unsafe block attracts more scrutiny from code reviewers. Also, this prevents pointers from being used in places where using them can and should be avoided.

Even the NSA recommended that developers stop using C and C++ for the reasons described above. It appears that the NSA was right after all.

6. Soak testing and having a staging environment can be much more important than continuous delivery

This is another point that may ruffle some feathers.

There is a popular idea in the software development community that every software company needs to have a fully automated continuous delivery pipeline capable of pushing out multiple production releases every day. While having a production-like staging (or pre-prod) may not be fully incompatible with this idea, using this environment for soak testing is anathema to the continuous delivery philosophy.

For those who aren’t familiar with what soak testing is, it’s when a new version of the software is left in a production-like environment for some time, typically days, to see if it exhibits any unexpected behavior and causes any side effects. It obviously contradicts the continuous delivery philosophy because if your soak-testing period lasts more than a day, you can’t deploy multiple times per day.

However, while I have nothing against continuous delivery, it certainly isn’t suitable everywhere. In this particular case, having continuous delivery would not have added any value, as you wouldn’t want this type of software to be released every day, let alone multiple times a day. However, soak-testing in a production-like environment (i.e. on an isolated machine running Windows) would have almost certainly led to the discovery of this defect before it reached production.

7. No operating system is safe

When the incident happened and the pictures of screens showing the infamous blue screen of death started popping up all over the news feeds and social media, many people started talking about how this wouldn’t have happened on Linux or Mac.

However, the reality is that it could have well happened on Linux or Mac. There is nothing fundamentally different from these operating systems that would have prevented it from happening.

The blue screen of death is not a Windpws-specific phenomenon. A screen of death showing a BIOS error and making your computer unresponsive happens on Linux and Mac too. Only that it’s white on Mac and typically black on Linux.

Also, a very similar incident involving the same cybersecurity company has actually happened on Linux in the past. Only that nobody noticed because there are nowhere near as many user-facing Linux computers as there are Windows ones. Not to mention that almost every previous large-scale IT outage affected Linux servers.

Mac has a better track record, as there aren’t any known IT outages involving Apple devices. However, this has nothing to do with the reliability of the operating system. A simple Google search of Mac brick after an update will give you an idea of how many problems happen with Mac devices during routine software updates.

The only reason there weren’t any major IT outages caused by problems with Mac devices is that these types of devices are hardly used in the industry. Other than being occasionally seen at the reception desk in a high-end hotel, Macs are primarily consumer devices.

8. System backups should be standardized

Despite a single file with faulty code, the problem caused by this incident was difficult to fix. It required booting the affected Windows machine in safe mode, deleting the affected file, and performing a few more manual actions.

All of this could have been prevented if backing up the system was part of any major software update. It doesn’t even have to be every update of every software, but it should certainly be the default functionality before an update of software that has access to the low-level system components.

The backup should be kept for some time, so the system can be easily restored to the last known good state. No manual intervention would then have been required. This could be configured to be done automatically as soon as a major system-level problem is detected.

This solution is not new. It’s used by many types of software. So why not make it standard for operating systems?

9. Don’t deploy to every device all at once

One of the major reasons a faulty software update caused an outage is that the update was pushed to all devices at the same time. This is known as the YOLO deployment. However, if a different deployment strategy had been chosen, the consequences of a faulty release would have been much smaller, even if the defect had, somehow, made its way into the release.

One such strategy is Canary deployment. This is when the software gets deployed to only a very small subset of the devices before it gets rolled out to all devices. In this case, if this deployment strategy had been used, this small subset of devices would still have been affected, but no faulty software would have been deployed to any other devices until the fix was released.

CrowdStrike would still need to get involved in fixing the affected devices, but there would be way fewer devices to fix. The global outage would surely be avoided.

Another good strategy is to not push the release out automatically and merely tell the users that it’s available to be installed at the time of their choosing. Some devices would still be affected, but there is also a good chance that, by the time the majority of users would get to install the update, the fix would already be included in the update.

10. Don’t put your eggs into the basket that everyone else shares

The final point is that this outage wouldn’t have happened if fewer critical infrastructure systems relied on the same cybersecurity software.

In a normal situation, the fact that a particular product has a large market share tells us that this product is reliable and trusted. But when it comes to software products, it’s a double-edged sword.

Choosing the same software product as everyone else makes you vulnerable to any problems that this product has. It can be a bug, like in this case, or it can be a cyberattack like it happened on many occasions in the past. If a group of hackers manages to break the security of a specific software, they will gain access to any devices that use this software.

So, if you can, try not to put your eggs into the same basket that everyone else uses. Diversify your software assets, especially when it comes to the products that have access to your system.

Wrapping Up

While all of us make mistakes and I wouldn’t blame individual developers who were involved in developing software for CrowdStrike, these types of incidents are completely preventable.

Certain rules can prevent faulty releases from going out. Many experienced software engineers know exactly what these rules are. We, as engineers, need to make sure that these rules become the standard operational procedures at the companies we work for. Basic principles, like having automation testing and minimizing human error, should not be treated by managers and executives as unnecessary “nice-to-haves”.

P.S. If you want me to help you improve your software development skills, you can check out my courses and my books. You can also book me for one-on-one mentorship.

Fiodar’s Tech Insights

Discussion about this post