Greetings. As everyong likely knows we had a long outage yesterday of our primary datacenter (PHX2). It started around 14:45 UTC and lasted until around 06:45UTC saturday morning. The outage was networking/routing related, but we don't know any further details. Happy to share what we know and can share when investigations are completed. The purpose of this email however is to talk about what we might be able to easily adjust to handle things like this moving forward. Happily these sorts of outages are very rare, but there's a few things I think we can do to make them a bit easier on us. First, lets talk about things that went well: * The mirrorlist containers worked great, so mirrorlist impact was low (but not nothing, see below). * Our distributed proxy system meant things like getfedora.org and other static websites were fine. * Our mirror network + metalinks/mirrorlists meant that people could still get updates, etc just fine. * status.fedoraproject.org was just great and even survived slashdotting (well, thats not all it used to be :) And some things could have been better: * We couldn't take proxy01/10 (The two proxies at phx2) out of dns because we didn't have the dnssec keys available anywhere to write new data out. This meant that sometimes people would hit those in dns and get a timeout and have to retry. * We couldn't reach our offside backups at RDU2. Currently we access them via a VPN that is from PHX2. We could of course have bugged someone to get us access, but they were all busy with the outage. * geoip was down, so might have affected installs. We may be able to move this to a container. * sudo anywhere depends on fas being up, but it was not, so the only way we could get root on vm's was to go to console and login directly as root. We also did a bit of brainstorming on IRC for what we could do if an outage like this happened again and was longer. Without a bunch of new hardware somewhere the buildsystem side of things wouldn't be very practical to bring up anywhere else, but we could bring some non build related services back up in other datacenters. Or at least improve the errors they give. So, IMHO, some action items we might consider: * Create a batcave02.rdu that has all our repos, ansible keys and dns keys. We can then use this to push things out if needed. It shouldn't be too hard to pull from batcave01 or otherwise keep things in sync. * Consider having a tested set of configs for a long outage like this: haproxy/apache could be adjusted to show a outage page for all the down services and we could bring things up more gradually. * Consider a ipa03.rdu freeipa replicant. This would at least allow kerberos auth (but of course with most things that use it down it doesn't matter too much). * Accelerate plans to move away from pam_url for sudo to something that will work when fas is down. kevin
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx