Hello Folks, I wrote a monitoring script[1] for check_mk[2], which is capable of integrating "puppetstatus" and a last-run check, and shows whether puppet is disabled (also the reason for disabling it) or when the agent was last run. It cannot check for what puppet did and whether the catalog compiled successfully, I'd +1 for puppet-dashboard here. Speaking of enhancing things, I'd as well like to suggest switching local-host checks from NRPE to check_mk via SSH. There are certain benefits: • Automatic inventory of checks per host • For each host, Nagios only triggers only active check any more, which returns all data needed to feed the other, passive checks. • All other checks just use cached data from the active check. • check_mk also extracts performance data and can directly insert that into round robin databases. Regards, Stefan. [1] - https://github.com/sts/checkmk/tree/master/puppet [2] - http://mathias-kettner.de/check_mk.html On Mar 18, 2011, at 17:28 , Toshio Kuratomi wrote: > On Fri, Mar 18, 2011 at 11:04:32AM -0400, seth vidal wrote: >> Hi folks, >> some thoughts have been slowly coalescing in my head about how we're >> managing our boxes/services and I have some suggestions I've passed by >> various folks but I wanted to check them out with everyone: >> >> >> 1. puppetd sucks..... memory. Right now we have puppetd running on every >> box and it wakes up every half hour and runs itself. This is fine but in >> the time where it is not doing anything it just eats memory for no good >> reason. I'd like to suggest we move to a cron-driven model instead of >> puppetd. I'd write a simple cron job that runs every half hour to run >> puppetd, if a lock file is not found. Pretty straightforward, of >> course. >> > +1 > > Might need to update kickstarts and/or the SOP pages: > > http://fedoraproject.org/wiki/Kickstart_Infrastructure_SOP > http://fedoraproject.org/wiki/Puppet_Infrastructure_SOP > >> 2. monitoring if puppetd has run properly: >> two things we want to know about puppet runs: >> a. when they last happened per-box >> b. if they fell over in a horrible way. >> >> (a) can be known by looking at the $nodename.yaml file which lives >> on the puppetmaster. I've written a script to check if that file is >> older than 1 hour and report the nodename if it is. >> (b) can be done via the cron job - ie: taking error output from the >> puppet run and mailing to people until we fix it! :) >> > +1 > >> 3. sign** boxes. problems here: >> a. These boxes are falling out of date, repeatedly, b/c they aren't >> in our normal updating path. >> b. these boxes don't email out to the same locations as the other >> boxes >> c. these boxes don't get faspassword updates properly >> d. these boxes don't get config changes normally via puppet >> >> (a) I'd like to suggest that they be put into a normal updating path >> and/or we setup a nag mail to tell us about them >> (b) obviously, fix their mail configs >> (c) fasclient is failing b/c of a missing token b/c, most likely, of >> (d) >> >> I'm open to suggestions on those but it is a bit annoying b/c while I >> understand their 'sensitivity' I think our way of treating them is >> making the problem WORSE not better. >> > I agree with your assessment. I guess we need to tell releng our concerns > and figure out what needs to be done For a: perhaps have releng okay us/a > specific subset of sysadmins to run updates along with all the other > updates. > > -Toshio > _______________________________________________ > infrastructure mailing list > infrastructure@xxxxxxxxxxxxxxxxxxxxxxx > https://admin.fedoraproject.org/mailman/listinfo/infrastructure -- Stefan Schlesinger ////////////////////////////////////////// /////// sts@xxxxxx _______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure