On 5 October 2010 16:44, Mike McGrath <mmcgrath@xxxxxxxxxx> wrote: > In an effort to further hide the fas issues we've been running into I've > added an event handler to the app servers. A brief description of the > problem is when fas hangs, app server httpd processes stack up. When they > do they become unresponsive. > > Currently nagios does this on failure: > > Failed check 1: nothing (Soft) > Failed check 2: nothing (Soft) > Failed check 3: Send notification (hard) > > Once it hits that hard state, nagios claims its dead. We get paged, the > alert shows up in #fedora-noc. Doom. > > Now what it does is this: > > Failed check 1: nothing (Soft) > Failed Check 2: send notification to #fedora-noc, issue a service httpd > reload > Failed Check 3: Send paged / emailed notifications, issue a service httpd > restart > > > This is a very different change from how things were and as such we should > track this closely. The reason for the notification issue to #fedora-noc > is to ensure things aren't auto-correcting without us knowing. But at the > same time we're not generating a lot of un-needed email / paged alerts. > I'm going to let this run for a while and lets see how it goes. > > pkgdb, for whatever reason, has always been an excellent canary which is > why I'm checking it. > Hi: It looks OK to me, but, do you've stats about how many time you get a 2nd fail check without reaching a 3rd? I'm thinking in network micro-outage, load peaks or something funny in the server. Maybe it needs to be a 4 checks service (reload at third). In the other hand, it's just a reload of httpd ;) Regards -- "My name is Ozymandias, king of kings: Look on my works, ye Mighty, and despair!" Percy Bysshe Shelley http://sites.google.com/site/carlossepulveda _______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure