Microsoft 365 Product Outage and Impact
There are two types of outages - the ones that can be avoided and the ones that can be avoided.
Microsoft, one of the largest and technologically advanced companies on the planet, had an outage yesterday. Due to the nature of the beast, a lot of people were impacted by it. I wanted to write a brief opinion on what happened and how to avoid it.
Microsoft is on record saying that the outage was not one of malicious origins, and therefore it could be one of two things:
- Physical hardware breaking
- Newly introduced bug
Figure 1. Have you seen this message?
This sort of thing is unavoidable. No hardware on the planet that will ever be fool-proof, but this is a known commodity at this point. The reason I say that is because there should have been plenty of mitigating factors in place to ensure that in the event of a hardware malfunction, appropriate measures would be in place to failover. The outage may have occurred, but there is no way to be sure. However, if this was the case, they should have tested in pre-production the highwater mark of active users on their system and ensured that they could handle this hardware failure scenario.
Newly Introduced Bug:
This sort of thing is also unavoidable to a certain degree. There is never going to be a software development company that releases a completely flawless piece of software, EVEN if you are Microsoft. During the pre-production stages of a release, you go through a series of tests during software development. First, testing each function of code is tested within the code itself. Afterward, you will do a functional test. During the functional test, each function of the software application is tested (what is the software supposed to do? Well, it’s supposed to log you in. Can it log you in?). User Interface testing generally follows and then load and penetration testing.
It appears that most likely, the issue was related to load testing. I am only speculating here, but this issue seems to have been exposed at a specific volume based upon Microsoft’s feedback, related to the outage.
They stated that this appears to have been an issue in previous releases when put under the same set of circumstances.
So why am I pointing all of these things out? My responsibilities are in educating IT professionals on the value of testing, and I feel strongly this is a scenario where the appropriate testing could have avoided potential downtime.
The bonus points here are as it relates to the size of the outage. Microsoft has made a concerted effort, and rightly so, in centralizing the consumption of their products. The issue this creates is the magnitude. Bottom line: test, test, test, and then test some more.