What can we learn from Internet outages?
Now that we have had a couple of weeks to look back on the great Facebook Outage of 2021, and it’s not looking like much more detail will be forthcoming, let’s see if we can learn anything from what has been made public about this and some other recent outages. It was nice that the Register decided to retweet our article on the perils of Internet Centralization since that saved us from trying to claim we had brilliant foresight. But is there more to this story than confirming what we already believed?
One of the networking talks that left a strong impression on me was given by Najam Ahmad, VP of Network Engineering at Facebook, in 2015 at the second P4 workshop. The takeaway that I have retained from that talk and passed along frequently was that software can do a better job of running networks than humans. Not that software is perfect, but using software to automate network operations is a way to avoid the configuration errors that humans are way too likely to make. Facebook was at the forefront of network automation, and it’s a much more mature field now than in 2015. So when we heard that Facebook had managed to disrupt a critical piece of networking infrastructure through configuration error, I was pretty surprised. Nothing that I know about the way Facebook operates made me think they would be likely to push out an untested configuration. While the motto “move fast and break things” lives on in the popular image of Facebook, it was actually back in 2014 that Facebook embraced a new motto: “Move fast with stable infrastructure”. (It doesn’t have the same ring to it, as WIRED magazine noted with admirable understatement.)
What we know at this point, thanks to the blogs that came out from Facebook engineering, is that a configuration change, intended only to “assess the availability of global backbone capacity”, was run through an “auditing tool”, allowed to go ahead, and subsequently led to most of Facebook’s DNS servers becoming unreachable. From outside Facebook, this was observed as BGP routes being withdrawn and DNS resolution requests failing. It appears there was at least one automated step: “our DNS servers disable[d] those BGP advertisements” when they lost connectivity to the backbone. These failures led to a cascading series of problems including Facebook employees being unable to gain physical access to the machines that needed to be reconfigured–the physical security system apparently depended on a working network. (Facebook refuted rumors that angle grinders were used to gain physical access to servers.)
There is certainly something to be learned here about the way distributed systems can exhibit central points of failure in spite of their designers’ best intentions. DNS is a distributed system that should tolerate all sorts of failures, and BGP is the routing protocol on which the Internet depends, with lots of capabilities to route around failures. Yet a configuration error managed to take enough of Facebook’s DNS offline to disable most Facebook services, both external (including WhatsApp and Instagram) and internal (e.g., physical security).
Building distributed systems that tolerate any possible failure is hard. When I worked at Nicira, our SDN product was a highly fault-tolerant, distributed system that was designed to gracefully handle the failure of any component, including a complete loss of the control plane. Yet we still managed to hit occasional corner cases that would take the system down. Similarly, Facebook apparently did have a backup path for communicating with their servers, but according to their blog “Our primary and out-of-band network access was down”. So it’s not that their design was flagrantly brittle (or at least, we don’t have evidence that it was), but it failed in a way that was unanticipated.
The piece of the story that really caught my attention was the mention of an “auditing tool” that allowed the offending configuration change to go ahead. I was immediately reminded of the phrase “Who watches the watchmen?” (which I’m pretty sure I learned in Latin at school). Who audits the auditors?
It seems that this is an area in which Facebook, in retrospect, could have made more investment. As we noted previously, there is a rich and rapidly growing set of tools to perform network verification. For example, you can find lots of details on how to use Batfish to test BGP configuration changes before deploying them, and it’s straightforward to test assertions like “will the following prefixes be reachable after this change”. It’s not that Facebook wasn’t trying to tackle this problem, but it seems that their in-house auditing tool wasn’t robust enough to catch a really serious configuration error before it was pushed out.
As if to remind everyone that BGP configuration changes should be tested before being implemented, Telia managed to get in on the act with this outage a few days later, on Oct 7, 2021. We know less about what happened in that case but again it makes a case for testing config changes before pushing them to production.
What to take away from all of this? It seems most people have been able to confirm their existing beliefs, whether it’s the perennial role of DNS in causing outages, the difficulty of getting BGP to behave, or the wisdom of separating the control and data planes. And the fact that so much of the world depends on WhatsApp does back up our earlier point about excessive centralization.
In my case, I’m refining my views on network automation. While it can reduce opportunities for human error, it can also compound errors as seems to have happened here. I am, however, more convinced than ever that we need to do a better job of verifying networks, and that includes verifying the behavior of automated systems. CI/CD with automated testing, which is commonplace for software systems, is still too rare for networks. Sure, the tools we have may not be perfect, but they provide a layer of robustness that is all too often lacking. Just as it’s nearly impossible for a single “pilot error” to bring down a commercial aircraft because of all the safety checks and systems in place, we should treat networks as software systems that can be verified before we make destructive changes. And the networks of the future can be better designed to make them verifiable.
You can find our earlier article on Network Verification at the Register. Our newest book on “Operationalizing an Edge Cloud” has reached 0.1 status and you can, as always, suggest improvements via GitHub. Our previous TCP congestion control post was our most popular to date, underscoring our point that congestion control is a perennially interesting topic.
Disclosure: Bruce is an advisor for Intentionet, which provides commercial support for Batfish.