monitoring rebuild

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 7 Feb 2013, Kevin Fenzi wrote:

I'd like to look at revamping our nagios setup to be nicer.

Problems with our current setup:

- It's very easy to make syntax errors that you can only fix with
 further commits. (ie, no sanity checking at all).

Hard to prevent fat fingers -- some tools exist to emit well formed files, and there is a 'preflighting' capability in nagios that might be added before applying a VCS commit

- nagios alerts us too much for things that don't matter or that we
 cannot do anything about. (isolated network issues, servers not
 responding in cases where there are no impacts, alerting then
 recovering very quickly)

That sounds like simple tuneing about alert thresholds and may be done iteratively as a given false positive becomes overly annoying

- our dependencies are not right, so we get 50 pages for an issue that
 is a single network or gateway link being down.

That is solved in the four corners of Nagios which permits maintaining 'maps' of what is 'behind' a given node. Nagios does not look, nor complaign further down an impaired path, noce configured

- nagios monitors remote machines over vpn links, but also over non vpn
 links, leading to us having confusion over what a machine is named in
 nagios.

Using FQ hostnames, and having a 'dummy' TLD of '.VPN' comes to mind

- Our current setup isn't flexable enough to allow monitoring of non
 core resources, but we want to do this. (for example, qa machines
 or secondary arch machines or cloud machines, where we may want
 different groups to get alerted or manage those machines).

Nagios has the concept of notification sub-groups for clutches of machines

- When we add new machines or services we often forget to add
 monitoring, or we don't monitor all the things we should be.

That is more on the matter of a problem in the accession and de-cessions checklists, no?

- When we do work on machines we sometimes forget to silence nagios.

<hal> 'This kind of thing has happened before, and it has always been ... human error' </hal voice>

... snip here, mention of yet more technological tools ...

Nagios may or may not be the right solution, but it is not very loady and works better than at least 80 pct of the alternatives I've tried, I'd say -- we run (locally and externally) Zabbix, bugzilla, smokeping, cacti, OpenNMS, a local wiki from tarballs, a local mediawiki, local custom SNMP trapping, custom DB backed outage tracking and remediation -- and each has faults or things I wish they did differently. At least Nagios is reasonably well extensible

I have a couple of bugs open on some of those tools in EPEL and they have not resulted in any appearent change, so it may simply be that the talent is too thin, and I do not have a solution for that

Most of what you outlined as problems is either a usage or training issue, or a systematic sysadmin issue -- ladling puppet or cfengine, or ansible, or (name the new and shiny tool of the day) -- a technology solution -- won't solve people issues (not reading, not testing, forcing commits of broken setups, and room for design and doco improvements)

-- Russ herrold
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure



[Index of Archives]     [Fedora Development]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]

  Powered by Linux