On Thu, 7 Feb 2013, Kevin Fenzi wrote:
I'd like to look at revamping our nagios setup to be nicer.
Problems with our current setup:
- It's very easy to make syntax errors that you can only fix with
further commits. (ie, no sanity checking at all).
Hard to prevent fat fingers -- some tools exist to emit well
formed files, and there is a 'preflighting' capability in
nagios that might be added before applying a VCS commit
- nagios alerts us too much for things that don't matter or that we
cannot do anything about. (isolated network issues, servers not
responding in cases where there are no impacts, alerting then
recovering very quickly)
That sounds like simple tuneing about alert thresholds and may
be done iteratively as a given false positive becomes overly
annoying
- our dependencies are not right, so we get 50 pages for an issue that
is a single network or gateway link being down.
That is solved in the four corners of Nagios which permits
maintaining 'maps' of what is 'behind' a given node. Nagios
does not look, nor complaign further down an impaired path,
noce configured
- nagios monitors remote machines over vpn links, but also over non vpn
links, leading to us having confusion over what a machine is named in
nagios.
Using FQ hostnames, and having a 'dummy' TLD of '.VPN' comes
to mind
- Our current setup isn't flexable enough to allow monitoring of non
core resources, but we want to do this. (for example, qa machines
or secondary arch machines or cloud machines, where we may want
different groups to get alerted or manage those machines).
Nagios has the concept of notification sub-groups for clutches
of machines
- When we add new machines or services we often forget to add
monitoring, or we don't monitor all the things we should be.
That is more on the matter of a problem in the accession
and de-cessions checklists, no?
- When we do work on machines we sometimes forget to silence nagios.
<hal> 'This kind of thing has happened before, and it has
always been ... human error' </hal voice>
... snip here, mention of yet more technological tools ...
Nagios may or may not be the right solution, but it is not
very loady and works better than at least 80 pct of the
alternatives I've tried, I'd say -- we run (locally and
externally) Zabbix, bugzilla, smokeping, cacti, OpenNMS, a
local wiki from tarballs, a local mediawiki, local custom SNMP
trapping, custom DB backed outage tracking and remediation --
and each has faults or things I wish they did differently.
At least Nagios is reasonably well extensible
I have a couple of bugs open on some of those tools in EPEL
and they have not resulted in any appearent change, so it may
simply be that the talent is too thin, and I do not have a
solution for that
Most of what you outlined as problems is either a usage or
training issue, or a systematic sysadmin issue -- ladling
puppet or cfengine, or ansible, or (name the new and shiny
tool of the day) -- a technology solution -- won't solve
people issues (not reading, not testing, forcing commits of
broken setups, and room for design and doco improvements)
-- Russ herrold
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure