Summary of Event ================ Tonight there was an unplanned outage of two proxy servers (proxy01 and proxy02). The proxies were unresponsive and needed to be rebooted in order to come back online. Proxy01 being down caused a cascade of other issues that should have had very little end-user impact. As far as we know, the applications on admin.fp.o would have been up but appeared very slow and the wiki would have been up for reading but logging in would have failed. Explanation to follow. Proxy01 is the only proxy server that is used for app servers (web apps, cronjobs, etc) in phx2 that need to talk to our web applications in phx2. This was setup because the router that handles traffic into and out of phx2 does not allow us to "hairpin", send a request for data from phx2 to an external ip address that then resolves back to a server in phx2. As currently implemented, we have an /etc/hosts entry that points admin.fedoraproject.org at the internal ip address of phx2. When proxy01 went down, things in PHX2 that needed to talk to admin.fedoraproject.org were no longer able to get the data they needed. For the wiki, this meant that attempting to login during the outage would be unable to verify the password in fas. For the TurboGears apps on admin.fedoraproject.org the situation was worse. TG1 apps' identity management depends on visit tracking to work. Visit tracking hits fas for every request. This means that no page could be served for the TG1 apps from the phx2 app servers. We have two app servers that reside outside of phx2. Because of network latency between these servers and the database server in phx2, these servers are configured to be backups for the servers in phx2, not handling requests unless phx2 is unable to. The remaining proxy servers detected that the app servers within phx2 were down and properly switched over to app servers outside of phx2 so there was no apparent outage for people trying to use admin.fedoraproject.org, although response time would have been drastically less. Looking at the haproxy status page for proxy03 during the outage we noticed that only one of the two app servers outside of phx2 (app05 at ibiblio) was handling traffic. app06 (at telia) was not. We are not sure why this is. One possibility is that telia's network latency is just too high so haproxy decided that app06 was also down and did not pass traffic to it. Action Items ============ There are some open questions to try to resolve: * Why did proxy01 and proxy02 die? A brief look at the logs has not revealed a cause for this. * Why didn't app06 take up any of the slack when haproxy started passing traffic to the backups? We have identified one means of mitigating this in the future: If we ran internal DNS for phx2 then we could have admin.fedoraproject.org resolve to different proxy servers (using internal ip addresses for the proxies inside of PHX2). This should remove the SPOF on proxy01. We have not yet determined whether we'd need to run more proxy servers inside of PHX2 or if hairpinning would not be an issue if we used proxy servers outside of phx2. -Toshio
Attachment:
pgpqR0mF9EHJt.pgp
Description: PGP signature
_______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure