In an effort to further hide the fas issues we've been running into I've added an event handler to the app servers. A brief description of the problem is when fas hangs, app server httpd processes stack up. When they do they become unresponsive. Currently nagios does this on failure: Failed check 1: nothing (Soft) Failed check 2: nothing (Soft) Failed check 3: Send notification (hard) Once it hits that hard state, nagios claims its dead. We get paged, the alert shows up in #fedora-noc. Doom. Now what it does is this: Failed check 1: nothing (Soft) Failed Check 2: send notification to #fedora-noc, issue a service httpd reload Failed Check 3: Send paged / emailed notifications, issue a service httpd restart This is a very different change from how things were and as such we should track this closely. The reason for the notification issue to #fedora-noc is to ensure things aren't auto-correcting without us knowing. But at the same time we're not generating a lot of un-needed email / paged alerts. I'm going to let this run for a while and lets see how it goes. pkgdb, for whatever reason, has always been an excellent canary which is why I'm checking it. Questions / comments? -Mike _______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure