In the first two weeks of this month, Amazon Web Services (AWS) encountered issues that caused two outages: a larger and more widespread on December 7 and a smaller and more localized on December 15. The two have catalyzed disruption in a range of websites and online applications, including Google, Slack, Disney Plus, Amazon, Venmo, Tinder, iRobot, Coinbase and The Washington Post. These services all rely on AWS to provide them with cloud computing. In fact, AWS is the leading provider of cloud computing among other big players like Microsoft Azure, Google, IBM, and Alibaba.
To understand why the impact has been so great and what steps businesses can take to avoid these kinds of disruptions in the future, it makes sense to take a step back and examine what cloud computing is and what it is for. serves.
So what is cloud computing and AWS?
Whenever you connect to anything on the internet, your computer is basically just talking to another computer. A server is a type of computer capable of processing requests and delivering data to other computers on the same network or through the Internet.
But running your own server doesn’t come cheap. You have to buy the hardware box, put it somewhere, and give it a lot of energy. In many cases, it also needs an internet connection. Then, to ensure that data is received and sent with minimal delay, these servers must be physically close to their users.
In addition, you need to install software that needs to be updated regularly. And you need to create security mechanisms that will switch operations to another server if a primary server malfunctions.
[Related: Facebook has an explanation for its massive Monday outage]
“What companies like Amazon have noticed is that a lot of [computing infrastructure] isn’t really specific to the service you’re performing, ”says Justine Sherry, assistant professor at Carnegie Mellon University.
For example, the code running Netflix does something different than the code running a service like Venmo. The Netflix code streams videos to users and the Venmo code facilitates financial transactions. But underneath, most of the IT work is actually the same.
This is where cloud providers come in. They usually have hundreds to thousands of servers all over the country with good bandwidth. They offer to take care of tedious tasks like security, day-to-day management of data center operations, and scaling of services as needed.
“Then you can focus on your [specialized] coded. Just write the part that makes the video work or the part that makes the financial transactions work. It’s easier, it’s cheaper because Amazon does it for many, many customers. Sherry explains. “But there are also downsides, which are that everyone in the world depends on the same Costco-sized warehouses full of computers. There are dozens of them across the United States. But when one of them breaks down, it’s catastrophic.
What went wrong with AWS on December 7 and 15
What caused the AWS crashes appeared to be related to errors with the automated systems handling the data flow behind the scenes.
AWS explained in an article that the December 7 error was caused by an issue with “an automated activity to scale the capacity of one of the AWS services hosted on the main AWS network”, which resulted in ” a sharp increase in connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in communication delays between these networks.
[Related: A Look Inside the Data Centers of “The Cloud”]
This auto-scaling capability allows the entire system to adjust the number of servers it uses based on the number of users on the network. “The idea is that if I have 100 users at 7 a.m. and then at noon everyone’s on amazon’s lunch break and now I have 1000 users, I need 10 times that. computers to interact with all of those customers, ”says Sherry. “These frameworks automatically examine the extent of demand and can devote more servers to doing what’s needed when it’s needed. ”
Later on December 15, a status update released by AWS said the outage was due to “traffic engineering” improperly moving “more traffic than expected to parts of the AWS backbone that affected the connectivity to a subset of Internet destinations ”.
Large data centers have many Internet connections through different Internet service providers. They can choose where online traffic is routed, whether it’s cable through AT&T or other cable through Sprint.
Their automatic “traffic engineering” decides to reroute traffic based on a number of conditions. “Most providers will redirect traffic primarily based on load. They want to make sure things are relatively balanced, ”Sherry says. “Looks like auto-adaptation failed on the 15th and they ended up carrying too much traffic on one connection. You can literally think of it as a pipe that has had too much water and water is coming out of the seams. This data ends up being deleted and disappears.
Despite a few frequent outages over the past few years, Sherry says AWS is “good enough to run its infrastructure.” By nature, it is very difficult to design perfect algorithms that can anticipate every problem, and bugs are an annoying but regular part of software development. “The only thing that is unique about the cloud situation is the impact. ”
[Related: Amazon’s venture into the bizarre world of quantum computing has a new home base]
A growing number of independent companies are turning to centralized third-party services like AWS for cloud infrastructure, storage, and more.
“If I pay Amazon to run a data center for me, store my files, and serve my customers… they’ll do a better job as an academic administrator or a small business administrator than I do,” Sherry says. “But from a societal point of view, when all these small individual players decide to outsource to the cloud, we end up with a very large centralized dependency. “
Back to basics?
During the time that AWS was out, Sherry couldn’t control her television. Normally she uses her phone as a remote control. But the phone does not speak directly to the TV. Instead, the phone and the TV are both talking to a server in the cloud, and that server orchestrates that in between. The cloud is essential for some functions, such as downloading automatic software updates. But to scroll through the cable offerings available from an antenna or satellite, “there’s no reason for that to happen,” she says. “We’re in the same room, we’re on the same wireless network, all I’m trying to do is change the channel. In short, the cloud can offer practical technology solutions in some cases, but not in all.
[Related: This Is Why Microsoft Is Putting Data Servers In The Ocean]
One of the accounts of a discontinued technology that struck her the most as an unnecessarily hijacked design was a timed cat feeder that had to go through the cloud. Automated cat feeders existed long before the cloud. They are essentially coupled with an alarm clock. “But for some reason, someone decided that instead of building the alarm clock part in the cat feeder, they were going to put the alarm clock feeder in the cloud, and ask the cat feeder to go to the internet and to ask the cloud, is it time to feed the cat? said Sherry. “There’s no reason it needs to be put in the cloud.”
Going forward, she thinks app developers should take a look at every feature destined for the cloud and consider whether it can work without the cloud, or at least have an offline mode that isn’t as completely debilitating when in use. an internet failure, a data center or even a power outage. .
“There are other things that probably won’t work. You probably won’t be able to log into your online banking if you can’t access the bank’s server, ”says Sherry. “But so many things that failed are things that really shouldn’t have failed.”