Re: Top 10 services/servers/etc

Gareth Marchant <gareth@xxxxxxxxxxxx> · Sat, 05 Mar 2011 20:51:28 -0500

On Sat, 2011-03-05 at 16:06 -0700, Stephen John Smoogen wrote:
> On Fri, Mar 4, 2011 at 18:31, Gareth Marchant <gareth@xxxxxxxxxxxx> wrote:
> > Kevin Fenzi <kevin@xxxxxxxxx> wrote:
> >>
> >> On Fri, 04 Mar 2011 19:07:53 -0500 Gareth Marchant <gareth@xxxxxxxxxxxx>
> >> wrote: > Does the nagios stage environment operate in an equivalent manner
> >> to > prod such that testing nagios 3 in stage for these systems would >
> >> accurately reflect prod? I assume that there are specific monitors > for
> >> each of these systems that would need to be exercised? I can only > imagine
> >> what that list will look like... https://admin.stg.fedoraproject.org/nagios/
> >> You can see that it can't reach/monitor a lot of the things that the real
> >> instance does. The stg env just doesn't have access to all the things it
> >> would need outside it. kevin
> >> ________________________________
> >> infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
> >> https://admin.fedoraproject.org/mailman/listinfo/infrastructure
> >
> > How about devices? I am sure there are routers, switches, gateways,
> > firewalls and maybe storage hardware monitored by nagios that are high
> > priority/highly critical and worthy of test?
> 
> We don't control 99.999% of them and have no access to the beyond
> pinging them. In many ways our infrastructure is very much a "cloud".
> We have systems but everything else is outsourced :).
> 
> The storage hardware we can monitor is pretty much the Equalogics that
> releng has. Everything else we get through closed firewalled off
> networks.
> 
> > How deeply should testing go or, put another way, how much go-live risk can
> > be tolerated? Should a gap analysis of stage environment to production be
> > performed prior to making a nagios test plan? I am not sure how rigorously
> > structured this upgrade plan should be!
> >
> 
> If gap analysis or other items are itches you like to scratch we can
> work them into version 2 of the test plan(s). It would be a good
> training exercise for people to see how its done (as I only know it
> from consultants who were not doing it right according to the next set
> of consultants.) If they are not things you like to touch with a 10
> foot pole, I have no want to make a volunteer spend time on them.
> 
> Our go-live risk tolerance is pretty high as we have done upgrades
> with no test plan for 6-7 years now. The goal here is to start from
> something a bit more complex than "does the web page have errors, no
> then we are good." because we have grown to be more complex and end up
> with 4-8 hour periods of "well darn I completely forgot that."
> 
> So I expect that we will have many lessons learned after each to say
> "we will add this to testing next time." and then be able to do so. I
> guess what I am saying is lets do enough that it fits on an ipad
> web-page the first time and make it more complex as we go.
> 
> My general philosophy for people volunteering time on Fedora is:
> Rule 1: Do good work for others as you would want them to do for you.
> Rule 2: Have Fun
> Rule 3: Keep true to Freedom, Friends, First, and Features without
> breaking 1 or 2.
> 
> So don't stress over the test plan if it misses a bunch of stuff. [I
> am saying this out loud because I usually get stressed over such stuff
> and have to remind myself :).] My main hope is to learn how to do our
> stuff better incrementally.
> 
> I hope this helps better outline what we need to start with. If a
> deadline would work better, I would like to have Nagios be ready to go
> live by the first of April. What do we need to have noc01.stg tested
> by March 28th?
> 
> 

Perfect, this is exactly the philosophical viewpoint I was hoping to
get.  "Test plan" means different things to different people!
Fortunately the only itch I have to scratch is covered in "Rule 1." 

I will expand the basic plan I put together before.  I think that
expanding it just enough to cover the obvious stuff is sufficient based
on what I think I am hearing?  

For example:
1. Test the nagios system, for example exercise nagios services to
verify clean start/stop/restart, bounce the server to verify nagios
comes online without intervention and perhaps have several individuals
hit the nagios web interface while services restart to validate things
operate in an expected manner.
2. Turn down various services on various hosts and verify proper
notification, start with one or two services and progress to turning off
large(r) quantities of services simultaneously.
3. Test notification facilities, not sure exactly how mail alerts are
configured, but might be worthwhile to test broken smtp connectivity to
validate secondary alert functions like a fallback smtp connection or
text alerts? 

I will pad this basic list with some actual tasks, and would be happy to
hear other people's input and suggestions for items 1,2 & 3 above.

Is nagios 3 in stg the result of an in-place upgrade from nagios 2?
Should the essentials of the upgrade procedure be documented in order to
be replayed when the time comes in prod?

_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure