Greetings. I just made a nagios change that causes it to send the very first alert for something to _just_ irc. If you are active and looking at a problem at this point, please go and ack it on the web interface. This will stop escalations. It will then wait 10minutes and the next alert (if the problem hasn't recovered or been acked) will go to irc, email and pagers. It will then send every hour after that to irc, email, and pager until the problem is acked or solved. Rationale: * Much of the time now we have someone on irc who can look at and fix issues (since we have sysadmin main folks in europe). Paging everyone is causing pager fatigue especially when someone else is already fixing it. * We get a lot of alerts that are short network caused things that recover in a few minutes. There's usually 0 we can do about them, our users never notice them, and it's causing pager fatigue to page on them and then immediately page ok after bothering people. Ideally we would adjust these checks, and we should, but it's going to take a while to get them all right. * We often get a lot of alerts from 1 proxy or the like being rebooted or restarting apache. These usually only happen for a minute or two and there's no need to page on them. * We sometimes get alerts directly related to changes we are currently making in something and then go fix them. There's no need to page someone for this, just be aware of irc when making playbook or host changes and clean up anything you cause to alert. I'd like to get back to the idea that if you get a page it's an important thing you need to go look at, not "oh, nagios again". This is all subject to adjustment, but hopefully it will make life a bit easier for us sysadmin types and not cause any problems for anyone else. ;) kevin
Attachment:
pgp735Vot9bdW.pgp
Description: OpenPGP digital signature
_______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure