Nagios event handlers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In an effort to further hide the fas issues we've been running into I've
added an event handler to the app servers.  A brief description of the
problem is when fas hangs, app server httpd processes stack up.  When they
do they become unresponsive.

Currently nagios does this on failure:

Failed check 1: nothing (Soft)
Failed check 2: nothing (Soft)
Failed check 3: Send notification (hard)

Once it hits that hard state, nagios claims its dead.  We get paged, the
alert shows up in #fedora-noc.  Doom.

Now what it does is this:

Failed check 1: nothing (Soft)
Failed Check 2: send notification to #fedora-noc, issue a service httpd
      reload
Failed Check 3: Send paged / emailed notifications, issue a service httpd
      restart


This is a very different change from how things were and as such we should
track this closely.  The reason for the notification issue to #fedora-noc
is to ensure things aren't auto-correcting without us knowing.  But at the
same time we're not generating a lot of un-needed email / paged alerts.
I'm going to let this run for a while and lets see how it goes.

pkgdb, for whatever reason, has always been an excellent canary which is
why I'm checking it.

Questions / comments?

	-Mike
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure


[Index of Archives]     [Fedora Development]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]

  Powered by Linux