Re: Facebook DNS issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Tue, Oct 5, 2021 at 1:27 AM Theodore Ts'o <tytso@xxxxxxx> wrote:
On Mon, Oct 04, 2021 at 08:11:25PM -0700, Matt Joras wrote:
> I will hop in here one more time. It was not a botched BGP update. The DNS
> disappearance was an unfortunate but preventable side effect of a global
> backbone issue. Had DNS been functioning everything still would have been
> down: https://engineering.fb.com/2021/10/04/networking-traffic/outage/

It was a botched BGP configuration issue, and when BGP advertisements
were updated across the global backbone, caused large portions of
Facebook's network to not be reachable.  The fact that this brought
down Facebook's DNS appears to have significantly increased its
downtime, since Facebook's engineers couldn't authenticate to the
servers needed to fix the problem.

There are certainly a large number of operational questions which this
brings up --- why didn't Facebook have their own internal backbone
networks, with their own internal split-view DNS?  Why didn't have
ways so their SRE's could get direct access to some of the servers in
their data centers which didn't depend on access via the public or
internal Internet backone networks?  These are however, out of scope
of the IETF, because it has to do with how a particular site
configures its networks.

What Ted says above is entirely consistent with what Cloudflare and others observed from BGP route advertisements.

But I do dispute the notion that this is not an issue for IETF.  Split horizon DNS has been a common and necessary DNS configuration for 30+ years now. But there is no support for split horizon in the DNS protocol and that has consequences.

Or to look at the situation differently: Does the IETF want to be a part of the solution to this particular set of problems?

If this was a unique and unprecedented incident, we could safely ignore it. But it isn't. It is merely an incident that happened to have particularly widespread effects.


Availability is a security issue.

30 years ago, this organization was built around the notion that the people in it had a better idea of how to do communications than the telephone companies. And 30 years later, that is still where most people seem to be stuck.

We should be looking at how people are actually using the Internet in practice and working out ways to make those uses work better.

[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Mhonarc]     [Fedora Users]

  Powered by Linux