2017-03-21 PHX2 Outage thoughts

Kevin Fenzi <kevin@xxxxxxxxx> · Sat, 22 Apr 2017 07:42:18 -0600

Greetings.

As everyong likely knows we had a long outage yesterday of our primary
datacenter (PHX2). It started around 14:45 UTC and lasted until around
06:45UTC saturday morning.

The outage was networking/routing related, but we don't know any further
details. Happy to share what we know and can share when investigations
are completed.

The purpose of this email however is to talk about what we might be able
to easily adjust to handle things like this moving forward. Happily
these sorts of outages are very rare, but there's a few things I think
we can do to make them a bit easier on us.

First, lets talk about things that went well:

* The mirrorlist containers worked great, so mirrorlist impact was low
(but not nothing, see below).

* Our distributed proxy system meant things like getfedora.org and other
static websites were fine.

* Our mirror network + metalinks/mirrorlists meant that people could
still get updates, etc just fine.

* status.fedoraproject.org was just great and even survived slashdotting
(well, thats not all it used to be :)

And some things could have been better:

* We couldn't take proxy01/10 (The two proxies at phx2) out of dns
because we didn't have the dnssec keys available anywhere to write new
data out. This meant that sometimes people would hit those in dns and
get a timeout and have to retry.

* We couldn't reach our offside backups at RDU2. Currently we access
them via a VPN that is from PHX2. We could of course have bugged someone
to get us access, but they were all busy with the outage.

* geoip was down, so might have affected installs. We may be able to
move this to a container.

* sudo anywhere depends on fas being up, but it was not, so the only way
we could get root on vm's was to go to console and login directly as root.

We also did a bit of brainstorming on IRC for what we could do if an
outage like this happened again and was longer. Without a bunch of new
hardware somewhere the buildsystem side of things wouldn't be very
practical to bring up anywhere else, but we could bring some non build
related services back up in other datacenters. Or at least improve the
errors they give.

So, IMHO, some action items we might consider:

* Create a batcave02.rdu that has all our repos, ansible keys and dns
keys. We can then use this to push things out if needed. It shouldn't be
too hard to pull from batcave01 or otherwise keep things in sync.

* Consider having a tested set of configs for a long outage like this:
haproxy/apache could be adjusted to show a outage page for all the down
services and we could bring things up more gradually.

* Consider a ipa03.rdu freeipa replicant. This would at least allow
kerberos auth (but of course with most things that use it down it
doesn't matter too much).

* Accelerate plans to move away from pam_url for sudo to something that
will work when fas is down.

kevin

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx