October 30, 2024

4 important lessons learned after the recent Microsoft Crowdstrike outage

In life we learn lessons from everywhere, but the toughest come unexpectedly. It’s even more unexpected when it comes from a reputable company which handles data of important organizations around the globe. This incident is albeit not related to data leakage but about line of business applications being down.

To add even more into the secret sauce of this outage, there is this Crowdstrike, a Microsoft partner company supplying security solutions which acted like the small brick that cracked the whole wall. Seems to me that Crowdstrike implements some features deeper inside Microsoft products, most likely Windows and Azure.

While it’s not entirely clear what caused this major outage and the official statements will dilute the truth to some extend, I do believe that a small glitch which has not been caught during testing was the culprit. We will have to accept that this events will occur in the future regardless of the developer or tested skill level, regardless of AI adoption and regardless of security policies in place.

Lesson 1: we have become too tech reliant

Nowadays it seems we cannot go anywhere without having a piece of tech equipment with us. It so happens that gadgets break whenever we need them the most. And it’s the software which is problematic. In this regard, one ugly truth we must face: there is no perfect software! There are so many edge cases and so many possibilities that you cannot create a software that just works, there’s always a situation when it’ll have issues, it’s just the nature of it.

I can relate one situation this morning when I wanted to create an account for one supermarkets in the area for redeeming points on purchases. I was at the checkout desk waiting in line and I installed the app, created the account. At some point you had to choose your favorite store, but if you used the search feature the whole app froze. I just couldn’t complete the process in due time.

Lesson 2: don’t put your eggs in one basket

This is one of the best rules in investing which applies also in business scenarios. In this context, it’s more about not relying on a sole provider for running your apps. Microsoft is always a good reliable partner when it comes to cloud hosting and infrastructure. But if you had a backup provider, you could be less affected by this Crowstrike outage.

You can also see this situation on an operating system level: if all your enterprise machines run solely on Windows then you certainly have all your eggs in one basket. You can diversify with Mac or Linux machines as well and your business will be able to perform continuously.

Lesson 3: have proper test scenarios for releases

Not all the companies manage to properly test their new releases properly. There are cases when you need to perform full regression testing and still bugs can be found. Deadlines, lack of resources or time can lead to untested software reaching production environment.

Of course this has serious consequences on app performance, user experience, stress and pressure. Companies lose revenues, their profits are dwindling while recovery can be slow after such an incident.

I’ve seen in my career situations where code changes are pushed because of a deadline or for the sake of making a good impression about the team. My advice is to always take your time to test your “toys” before publishing. We all deserve good sleep and less stress πŸ˜‰.

Lesson 4: security policies and access segregation

In my recent work assignment I realized that every aspect when it comes to app lifecycle and deployment is strictly confined and monitored. It seemed really cumbersome in the beginning but in time I realized it’s own benefits:

  • User or client data is protected
  • A poor piece of code will not reach a higher environment
  • Proper audits can take place
  • Relevant support teams can react when incidents happen (and they do, albeit small ones if the process is respected)

Policies and access segregation should deny developers the right to deploy to higher environments (QA, PREPROD, PROD). Dev data should not exist in Prod and viceversa with respect to security policies. A release to a PROD environment should always be preceded by a QA deploy and thorough testing.

Teams should have a deploy plan which contains data migrations, infrastructure preparations etc. and of course a rollback plan in case things go downhill.

Maybe developers hate documentation, but it’s good to have it so that everybody is aligned with the process. Life is a process itself, exactly like developing apps.

Conclusion

This incident does not surprise me at all. Having worked with both big and small companies on different project sizes and technology mixes I realized still how chaotic the development process is even at big companies. People tend to believe that tech giants are immune to anything, but reality is harsh. We live intense, even under pressure many times and this rat race will constantly evolve into a quagmire of events.

Waiting for opinions on this, it really matters to me, Cheers! πŸ™‚

afivan

Enthusiast adventurer, software developer with a high sense of creativity, discipline and achievement. I like to travel, I like music and outdoor sports. Because I have a broken ligament, I prefer safer activities like running or biking. In a couple of years, my ambition is to become a good technical lead with entrepreneurial mindset. From a personal point of view, I’d like to establish my own family, so I’ll have lots of things to do, there’s never time to get bored πŸ˜‚

View all posts by afivan →