Are We Breaking The Internet?
Recent outages from critical services across the net have created massive disruption in recent weeks: Whether it was Amazon’s S3 service failure, which took down thousands of sites, Cloudflare’s “Cloudbleed” security issue, which forced many sites to ask users to reset their passwords, or Google Wifi’s accidental reset, which wiped out customer’s internet profiles, the infrastructure behind the internet has looked substantially more unstable recently.
The packetized technology that underlies most of the internet was created by Paul Baran as part of an effort to protect communications by moving from a centralized model of communication to a distributed one. While the Internet Society questions whether the creation of the internet was in direct response to concerns about nuclear threat, it clearly agrees that “later work on Internetting did emphasize robustness and survivability, including the capability to withstand losses of large portions of the underlying networks.”
From there, the foundation was laid for an internet that treated the distributed model as a key component to ensuring reliability. Almost 50 years later, consolidation around hosting and mobile and the development of the cloud have created a model that increases concentration on top of few key players: Amazon, Microsoft, and Google now host a large number of sites across the web. Many of those companies’ customers have opted to host their infrastructure in a single set of data centers, potentially increasing the frailty of the web by re-centralizing large portions of the net.
That’s what happened when Amazon’s S3 service, essentially a large hard drive used by companies like Spotify, Pinterest, Dropbox, Trello, Quora, and many others, lost one of its data centers on Tuesday morning. The problem began around 9:37 a.m. Pacific, the company later explained, after an employee tried to fix a problem with S3’s billing system: “an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers… Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
Companies that had content stored in those sets of servers, located in Northern Virginia, essentially stopped functioning properly, prompting experts to recommend that companies look at storing data across multiple data centers to increase reliability. The failure rippled across Amazon’s other services, many of which depend upon S3, leading to “increased error rates” for sites that rely on AWS, and making engineers’ efforts at recovery that much more difficult. Even the webpage Amazon uses to alert customers to outages was affected.