On Thu, 7 Feb 2013 17:40:30 -0500 (EST) R P Herrold <herrold@xxxxxxxxxxxx> wrote: > Hard to prevent fat fingers -- some tools exist to emit well > formed files, and there is a 'preflighting' capability in > nagios that might be added before applying a VCS commit Yep. We just need to hook it up... > > - nagios alerts us too much for things that don't matter or that we > > cannot do anything about. (isolated network issues, servers not > > responding in cases where there are no impacts, alerting then > > recovering very quickly) > > That sounds like simple tuneing about alert thresholds and may > be done iteratively as a given false positive becomes overly > annoying Yes. > > - our dependencies are not right, so we get 50 pages for an issue > > that is a single network or gateway link being down. > > That is solved in the four corners of Nagios which permits > maintaining 'maps' of what is 'behind' a given node. Nagios > does not look, nor complaign further down an impaired path, > noce configured Yep. This is complicated however by the following: we use nrpe to monitor/show hosts alive. This nrpe connection is over our vpn. foobar.vpn.fedoraproject.org could be down because: a) The machine really is down. b) The host it's running on is down, but the vpn doesn't have anything to monitor on that host, so you can't add a dep for 'foobars-host.fedoraproject.org' easily/transparently. c) The network to that site is down. I think this is fixable, but requires some careful rethinking. > > - nagios monitors remote machines over vpn links, but also over non > > vpn links, leading to us having confusion over what a machine is > > named in nagios. > > Using FQ hostnames, and having a 'dummy' TLD of '.VPN' comes > to mind That doesn't help. See above. ;) We do have a vpn subdomain. > > - Our current setup isn't flexable enough to allow monitoring of non > > core resources, but we want to do this. (for example, qa machines > > or secondary arch machines or cloud machines, where we may want > > different groups to get alerted or manage those machines). > > Nagios has the concept of notification sub-groups for clutches > of machines Yep. Again, it's a matter of setting it up and making it easy to manage. > > > - When we add new machines or services we often forget to add > > monitoring, or we don't monitor all the things we should be. > > That is more on the matter of a problem in the accession > and de-cessions checklists, no? yes, but it also should get automated as much as possble, as people sometimes miss things on checklists, but if it was automatically setup when a machine was added there would be less errors. > > - When we do work on machines we sometimes forget to silence nagios. > > <hal> 'This kind of thing has happened before, and it has > always been ... human error' </hal voice> Indeed. :) > ... snip here, mention of yet more technological tools ... > > Nagios may or may not be the right solution, but it is not > very loady and works better than at least 80 pct of the > alternatives I've tried, I'd say -- we run (locally and > externally) Zabbix, bugzilla, smokeping, cacti, OpenNMS, a > local wiki from tarballs, a local mediawiki, local custom SNMP > trapping, custom DB backed outage tracking and remediation -- > and each has faults or things I wish they did differently. > At least Nagios is reasonably well extensible Yeah, we tried zabbix a while back and it never made it to production. Big issues: only manageable via web interface, couldn't handle the load very well, difficult to add new checks to a wide pile of hosts. > I have a couple of bugs open on some of those tools in EPEL > and they have not resulted in any appearent change, so it may > simply be that the talent is too thin, and I do not have a > solution for that Indeed. Could well be. > Most of what you outlined as problems is either a usage or > training issue, or a systematic sysadmin issue -- ladling > puppet or cfengine, or ansible, or (name the new and shiny > tool of the day) -- a technology solution -- won't solve > people issues (not reading, not testing, forcing commits of > broken setups, and room for design and doco improvements) Yes, I didn't say this would be solved by magic tools. I just noted that we needed to revamp/clean up our config and do what we could to automate things at the same time. Thanks for the input. kevin
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure