Re: monitoring rebuild

Kevin Fenzi <kevin@xxxxxxxxx> · Mon, 11 Feb 2013 11:21:47 -0700

On Thu, 7 Feb 2013 17:40:30 -0500 (EST)
R P Herrold <herrold@xxxxxxxxxxxx> wrote:

> Hard to prevent fat fingers -- some tools exist to emit well 
> formed files, and there is a 'preflighting' capability in 
> nagios that might be added before applying a VCS commit

Yep. We just need to hook it up... 

> > - nagios alerts us too much for things that don't matter or that we
> >  cannot do anything about. (isolated network issues, servers not
> >  responding in cases where there are no impacts, alerting then
> >  recovering very quickly)
> 
> That sounds like simple tuneing about alert thresholds and may 
> be done iteratively as a given false positive becomes overly 
> annoying

Yes. 

> > - our dependencies are not right, so we get 50 pages for an issue
> > that is a single network or gateway link being down.
> 
> That is solved in the four corners of Nagios which permits 
> maintaining 'maps' of what is 'behind' a given node.  Nagios 
> does not look, nor complaign further down an impaired path, 
> noce configured

Yep. This is complicated however by the following: 

we use nrpe to monitor/show hosts alive. 
This nrpe connection is over our vpn. 

foobar.vpn.fedoraproject.org could be down because: 
a) The machine really is down. 
b) The host it's running on is down, but the vpn doesn't have anything
to monitor on that host, so you can't add a dep for
'foobars-host.fedoraproject.org' easily/transparently. 
c) The network to that site is down. 

I think this is fixable, but requires some careful rethinking. 

> > - nagios monitors remote machines over vpn links, but also over non
> > vpn links, leading to us having confusion over what a machine is
> > named in nagios.
> 
> Using FQ hostnames, and having a 'dummy' TLD of '.VPN' comes 
> to mind

That doesn't help. See above. ;) 
We do have a vpn subdomain. 

> > - Our current setup isn't flexable enough to allow monitoring of non
> >  core resources, but we want to do this. (for example, qa machines
> >  or secondary arch machines or cloud machines, where we may want
> >  different groups to get alerted or manage those machines).
> 
> Nagios has the concept of notification sub-groups for clutches 
> of machines

Yep. Again, it's a matter of setting it up and making it easy to
manage. 
> 
> > - When we add new machines or services we often forget to add
> >  monitoring, or we don't monitor all the things we should be.
> 
> That is more on the matter of a problem in the accession 
> and de-cessions checklists, no?

yes, but it also should get automated as much as possble, as people
sometimes miss things on checklists, but if it was automatically setup
when a machine was added there would be less errors. 

> > - When we do work on machines we sometimes forget to silence nagios.
> 
> <hal> 'This kind of thing has happened before, and it has 
> always been ... human error' </hal voice>

Indeed. :) 

> ... snip here, mention of yet more technological tools ...
> 
> Nagios may or may not be the right solution, but it is not 
> very loady and works better than at least 80 pct of the 
> alternatives I've tried, I'd say -- we run (locally and 
> externally) Zabbix, bugzilla, smokeping, cacti, OpenNMS, a 
> local wiki from tarballs, a local mediawiki, local custom SNMP 
> trapping, custom DB backed outage tracking and remediation -- 
> and each has faults or things I wish they did differently. 
> At least Nagios is reasonably well extensible

Yeah, we tried zabbix a while back and it never made it to production.
Big issues: only manageable via web interface, couldn't handle the load
very well, difficult to add new checks to a wide pile of hosts. 

> I have a couple of bugs open on some of those tools in EPEL 
> and they have not resulted in any appearent change, so it may 
> simply be that the talent is too thin, and I do not have a 
> solution for that

Indeed. Could well be. 

> Most of what you outlined as problems is either a usage or 
> training issue, or a systematic sysadmin issue -- ladling 
> puppet or cfengine, or ansible, or (name the new and shiny 
> tool of the day) -- a technology solution -- won't solve 
> people issues (not reading, not testing, forcing commits of 
> broken setups, and room for design and doco improvements)

Yes, I didn't say this would be solved by magic tools. I just noted
that we needed to revamp/clean up our config and do what we could to
automate things at the same time. 

Thanks for the input. 

kevin
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure