Greetings. I'd like to look at revamping our nagios setup to be nicer. Problems with our current setup: - It's very easy to make syntax errors that you can only fix with further commits. (ie, no sanity checking at all). - nagios alerts us too much for things that don't matter or that we cannot do anything about. (isolated network issues, servers not responding in cases where there are no impacts, alerting then recovering very quickly) - our dependencies are not right, so we get 50 pages for an issue that is a single network or gateway link being down. - nagios monitors remote machines over vpn links, but also over non vpn links, leading to us having confusion over what a machine is named in nagios. - Our current setup isn't flexable enough to allow monitoring of non core resources, but we want to do this. (for example, qa machines or secondary arch machines or cloud machines, where we may want different groups to get alerted or manage those machines). - When we add new machines or services we often forget to add monitoring, or we don't monitor all the things we should be. - When we do work on machines we sometimes forget to silence nagios. Requirements: - syntax check commits and reject them where they would break things. - only send pages for things that impact either end users or maintainers. Things that don't impact either of those should mail/notify, but not page. - correct dependencies so if one thing is down all that block on it really does not alert. - allow non FI groups to have machines/networks they manage. - When adding new machine/service, monitoring should be automatically configured if at all possible. - Possibly have some kind of escalations setup... page, wait X min for an ack, if not page again or page more people, etc. - When doing work have an easy way to silence alerts for affected machines. Some possible parts of the solution: - check_mk - This would allow us to just install and query it, and it would monitor all the running process, etc. Could very much simplify config on normal nodes. - ansible playbooks to disable notifications or alerts in cases where thats needed easily. We can use the web interface, but a simple script to do this would be nice to call from other playbooks. This needs consistent naming for hosts. - commit hook to check syntax. This will require nagios on lockbox01 and installing to a tree or something, unless there's a way we can pull the syntax checking out of nagios. Does python-nagios package do it? - Identify those items/hosts that are ones that should page, assume rest are email/notice only. The mass reboot sop has these already in the Class A/B hosts listed there. I can easily generate a list. I'm sure there's more things we can do here, and several people have looked into this problem in the past. Its not a simple one, but I think it's a good one to work on and fix. I'm not sure if it makes sense to make a new nagios git repo (like we did for dns) or just move it over to ansible repo. I kind of think it would be nicer just in ansible repo. This is just prelim ideas on the scope and problem... I'll look at starting on some more detailed work soon. kevin
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure