Last Thursday morning around 5 a.m. eastern time, Amazon suffered a major data center outage. These sorts of outages happen now and then, but they seldom make news. This time, though, mainstream outlets took notice, with CNN wondering "why Amazon's cloud Titanic went down." The server failures made headlines for two reasons. First, the Amazon failure took down some high-profile clients, including FourSquare, Quora, and Reddit. Second, it called attention to the steady movement by companies large and small to host their applications in the cloud.
For Amazon, whose Web Services division has built this vital business (and a whole industry) in just five years, the outage has been an embarrassment. (Disclosure: Milo and Vijay worked at Amazon for 10 and seven years respectively, and Milo worked on the Amazon Web Services team for a brief time.) Commentators have wondered whether this system failure would push the pause button on cloud computing as a solution for everything, and whether it would be a boon to Amazon's networking competitors. In the long term, though, it's quite possible that Amazon will benefit from last week's events.
Amazon has hubs for cloud computing services all over the country. Clients can choose to use just one hub—known as an "availability zone," it's essentially a conceptual data center that exists only in the cloud—or distribute across several nearby hubs, decreasing risk. As Reddit explained in a comprehensive post-mortem on the company's blog, the "outage affected a specific product … on a specific [Amazon Web Services] availability zone. Unfortunately, a disproportionate amount of our services rely on this exact product, in this exact availability zone." The result: a site crash.
What makes the Amazon outage troubling to many customers is not that one hub went down, but that the failure cascaded. Since details are lacking at the moment, it's hard to say why this happened. The most probable scenario is that once Amazon's customers realized that Elastic Block Storage was down in a single availability zone, they sent a huge number of simultaneous requests to the remaining availability zones. This could have led to decreased performance in Amazon's remaining servers, causing yet another wave of slowdowns.
Customers who thought they'd protected themselves went down as the problem spiraled out of control. Among the affected sites was our most recent project—the Washington Post Co's personalized news service Trove, which had launched one day earlier. Trove.com was set up with data replication and application servers in two different availability zones. Once Amazon engineers stabilized one of those zones, we were able to send all of our traffic over to the healthy AZ. About five hours after the outage started, Trove was back up and running.
Other websites were not so lucky. It is clear that several prominent services were not designed to withstand a single availability zone outage. "When we started with Amazon, our code was written with the assumption that there would be one data center," Reddit explained. "We have been working towards fixing this since we moved two years ago. Unfortunately, progress has been slow in this area."
There are multiple lessons here for Amazon and its clients. Most important, Amazon needs to do its best to ensure that such an outage doesn't happen again. It also needs to do a better job educating its customers about how to use Amazon Web Services in a way that ensures an outage will have the minimum possible impact. Amazon's tools do most of the work, and our personal experience recovering Trove is proof of that. We suffered the equivalent of a total data center outage but managed to be back up in five hours. All in all, we believe that's confirmation that we used our AWS resources well.
Small start-ups are not going to run away from the cloud or Amazon. The cost savings across the board and the ability to avoid capital commitments are just too enormous to sidestep. No venture capitalist is going to fire a start-up for using Amazon, and no sophisticated investor will accept a start-up spending extra time and money to build out its own infrastructure when its business is unproven. Moreover, the next wave of start-ups should be able to learn from last week's outages, figuring out how to avoid some of the problems that befell other companies.
The larger industry question is how this affects established companies, those with more revenue and reputation on the line than the little start-ups. So long as Amazon is transparent about what caused the outages and how it plans to fix the problem, CIOs will realize that the company has come a long way in the last few years. Amazon's cloud-based database service RDS, for instance, has capabilities that are extremely expensive to acquire and administer—without it, Trove wouldn't have been able to recover nearly as quickly. If customers become savvier about what Amazon and its competitors have to offer, this increased knowledge will likely help Amazon more than hurt it.
Regardless of their size, companies will need to decide if the benefits of diversification outweigh the costs. They'll also have to think about whether their engineers would do a better job managing that diversification than Amazon. A smart executive might realize that, even though she could manage her infrastructure in-house, her employees would better serve company interests by working on the problems unique to their specific business. Amazon can only hope that's the case—the future of its cloud business depends on it.