Facebook explains how its historic downfall happened and how it fixed it

The fall of Facebook worldwide that occurred on Monday has been a before and after in the company, and it is that they were completely disconnected from the Internet for more than 5 hours, something unprecedented for one of the largest companies in the world. Now that the Facebook platform, WhatsApp and Instagram have recovered 100% from the crash that occurred on Monday, the Facebook team has published details about how the crash happened, why it happened and also how they managed to fix it. Want to know all the details about the biggest crash in Facebook history so far?

How does Facebook work and why did its total downfall occur?

Facebook has indicated that the total interruption of service worldwide was due to a failure of the system that manages the capacity of the company’s backbone, this backbone is the “backbone” of the Facebook network, to connect all the data centers that Facebook has spread all over the world, which consist of thousands of servers and hundreds of kilometers of fiber optics, since they also connect its data centers with submarine cables. Some Facebook data centers have millions of servers that store the data and have a high computational load, but in other cases the facilities are smaller and are responsible for connecting the backbone to the Internet in general for people to use their platforms.

When any user like us connects to Facebook or Instagram, the request for data travels from our device to the closest facility geographically, to later communicate directly with the backbone to access the largest data centers, this is where it is retrieves the requested information and is processed, for us to see it on the smartphone.

All data traffic between the different data centers is managed by routers, which determine where inbound and outbound data should be sent. As part of their daily work, Facebook’s engineering team needs to maintain this infrastructure and perform tasks such as upgrading routers, repairing fiber lines, or adding more capacity on certain networks. This was the problem with the global Facebook crash on Monday.

During maintenance work, a command was sent with the intention of evaluating the availability of the global backbone capacity, but it accidentally cut off all backbone connections, disconnecting all Facebook data centers globally. Generally, Facebook uses systems to audit these types of commands, and mitigate or avoid errors like this, but an error (bug) in this audit and change control tool prevented the execution of the order from being stopped, and then everything fell apart.

What happened on Facebook when I ran the command?

As soon as the command was executed, it caused a total disconnection of the Internet and data center connections, that is, we could not access any of the Facebook services because they were no longer visible on the Internet. In addition, this total disconnection caused a second catastrophic failure in the system, more specifically in the DNS. One of the tasks that smaller data center installations perform is to respond to DNS queries, these queries are answered by authoritative name servers that have well-known IP addresses, and which are advertised to the rest of the Internet using the protocol BGP.

To ensure more reliable operation, Facebook has the DNS servers turn off those BGP ads if they cannot talk to Facebook’s data centers themselves, because this indicates that the network connection is not working properly. With the total disruption of the backbone, what these DNS servers did was remove the BGP advertisements. The result of this is that Facebook’s DNS servers became unreachable even though they were working perfectly, for this reason, the rest of the world could not access Facebook services.

Logically this whole process was in a matter of seconds, while Facebook engineers tried to figure out what was happening and why, they faced two critical problems:

It was not possible to access the data centers normally, because the networks were totally down from the first problem.
The crash of DNS broke many internal tools that are commonly used to investigate and solve problems of this type.

Access to the main network and out-of-band network were down, nothing was working, so they had to physically send a team of people to the data center to fix the problem and reboot the system. This took a long time because the physical security in these centers is maximum, in fact, as confirmed by Facebook, it is even difficult for them to physically access them to make modifications, in order to avoid or mitigate possible physical attacks on their network. This took a long time until they were able to authenticate to the system and see what was happening.

Coming back to life … but little by little so as not to throw the whole system away

Once backbone connectivity was restored in the different regions of Facebook’s data centers, everything was working fine again, but not for users. In order to avoid a collapse in their systems due to the huge number of users who wanted to enter, they had to activate the services very little by little, to avoid causing new problems due to the exponential increase in traffic.

One of the problems is that the individual data centers were using very little electrical power, suddenly reversing all the traffic could make the electrical grid unable to absorb that much additional power, and could put electrical systems at risk as well. I cached them. Facebook has trained for these types of events, so they knew perfectly well what they should do to avoid more problems in the event of a global crash like the one that has happened. Although Facebook had simulated many problems and crashes of their servers and networks, they had never considered a total drop of the backbone, so they have already stated that they will look for a way to simulate this in the very near future to prevent it from coming back. pass and it takes so long to fix.

Facebook has also indicated that it was very interesting to see how physical security measures to prevent unauthorized access caused access to servers to slow down enormously as they tried to recover from this failure globally. In any case, it is better to protect yourself daily from these types of problems and have a somewhat slower recovery, than to relax the security measures of the data centers.