Yesterday’s Facebook outage – which took down Facebook Messenger, Instagram and WhatsApp along with the main service – was the result of a mistake by the company’s own network engineers.
The error led to the inaccessibility of all of Facebook’s services, with an analogy comparing it to a failure of “air traffic control” services for network traffic …
We reported on the massive failure yesterday.
It’s not just you: Facebook, Instagram, and WhatsApp are all down for users around the world right now. We are seeing error messages on all three services in iOS apps as well as on the web. Users are greeted with error messages such as: “Sorry, something went wrong”, “Server error 5xx”, and more.
The outage affects all platforms owned by Facebook, according to data from Downdetector and Twitter. This includes Instagram, Facebook, WhatsApp and Facebook Messenger […] While some Facebook, Instagram, and WhatsApp outages only affect certain geographic regions, services are currently down worldwide.
It gradually became clear that the problem could be with DNS – the domain name servers that tell devices which IP addresses to use to access services – but it was not clear exactly what had happened, and if it happened. was an outside hack, malicious insider action, or catastrophic error.
Facebook has now admitted in a blog post that this was a mistake.
Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted that communication. This disruption in network traffic has had a cascading effect on the way our data centers communicate, causing our services to stop.
It took a long time to resolve the issue as the inaccessible systems included the servers and tools that engineers would normally use to resolve the issue remotely. Reports suggest lower-level employees needed to physically access data centers and then rely on step-by-step instructions from more experienced engineers in order to fix the error. To complicate matters, the unavailability of networks meant that Facebook’s door access systems were also offline, physically preventing access.
How to understand the Facebook crash
We’ll probably get the whole story over time, but the consensus emerging is that the problem was a mix of Domain Name Server (DNS) and Border Gateway Protocol (BGP) configuration.
The best analogy I’ve seen is to think of network traffic as airplanes. Your device wants to fly to facebook.com. Your aircraft must first know the GPS coordinates of the destination airport, i.e. the IP address to which it must connect. He obtains this information by asking a DNS, which tells him that facebook.com is located (for example) 18.104.22.168.
But getting to the final destination – the actual server that can do the task you want to do – relies on some kind of air traffic control system for network traffic, and that’s BGP. BGP tells your device which route to take through the various servers en route to your final destination.
It looks like Facebook completely lost its BGP systems – so there was no way Facebook could tell devices how to reach their destination. And that included Facebook’s own engineers reaching out to the systems they needed to fix the error.
The blackout has huge implications
If these were just people who couldn’t post cat videos for a few hours, that would be one thing (but, come on, what’s life without cat videos?). But WhatsApp is indeed an essential part of the communications infrastructure in many countries, commonly used for communication between patients and doctors, for example, and used by many for payments.
The extended outage has drawn attention to the vulnerability of the entire world to failures of this nature.
For example, millions of people rely on Google’s DNS servers to reach all the servers on the planet. Imagine that these servers go down for an extended period of time. This would not only affect consumers, it would disrupt commerce and critical infrastructure. Factory production, fleet transport, retail sales… the works.
The entire world is critically dependent on a relatively small number of servers, all of which could be taken offline by an error like the one that has occurred here. There is a lot of thought to be given to how we prevent a much bigger internet blackout in the future.
FTC: We use automatic affiliate links which generate income. Following.
Check out 9to5Mac on YouTube for more Apple news: