Topic: Outages
Microsoft says it has a fix to solve latest online services glitch – Axios (Mar 21, 2017)
The recent downtime for Microsoft’s various cloud services hasn’t got nearly the attention Amazon’s recent outage did, in part because it’s more of a brownout than the total blackout AWS experienced, and in part arguably also because fewer third party services rely on Azure and related services. But Microsoft has had a couple of recent issues, and as of right now they’re happening again. There will always be issues here and there with any large scale infrastructure, but that they’ve been lasting for hours and repeating at Microsoft recently is a little worrisome, and it’ll be good to see the explanation when Microsoft finally shares it.
via Axios
Amazon Explains its Massive S3 Outage (Mar 2, 2017)
Amazon’s S3 service went down in part of the US on Tuesday, something I commented on at the time. But we now have an official explanation, which is that an employee attempting to debug an issue with the billing system for AWS accidentally took down more servers than he/she intended to, which in turn had a knock-on effect on several other services which manage other aspects of the S3 system (including the dashboard which reports whether the service is performing as expected). Restarting several of the servers took far longer than anyone had expected, which meant Amazon’s contingency planning turned out not to be adequate after all. It sounds like it has now put in place some protections to prevent similar things from happening in future, but once again it’s just a reminder of how vulnerable big chunks of the Internet are to an AWS outage, something we discussed in depth on this week’s Beyond Devices Podcast, recorded earlier today shortly after this announcement was made.
via Amazon
Amazon cloud issues send Web publishers scrambling – Axios (Feb 28, 2017)
I might update this or post a follow-up later, since the outage is still underway and there isn’t yet an official explanation. But it’s already clear that this outage is having very widespread impacts, not just on a couple of big tech companies but on a variety of news sites and other businesses too. This is a great illustration of the enormous power a single player can have when it takes a dominant market share position, and conversely the danger customers put themselves in by failing to secure adequate redundancy. One of the changes between Snap’s original S-1 and its S-1/A filing was the inclusion of a deal with Amazon (ironically) to provide redundancy for its Google Cloud services, and I think it’s very unlikely the timing was a coincidence: I suspect the investors Snap talked to first were wary of its massive dependence on a single cloud provider. But of course that kind of redundancy can cost an awful lot, especially at scale – Snap’s contractual commitment to Amazon five years out is almost the same as its commitment to Google in the same year, which is not to say it will actually end up spending the same, but it’s indicative of the problem here. Of course, the Snapchat app hasn’t gone down today while many other services and sites have – if it had single-sourced based on Amazon, perhaps it would have done, which would have been disastrous the week of its IPO.
via Axios