On Fri, Mar 4, 2011 at 18:31, Gareth Marchant <gareth@xxxxxxxxxxxx> wrote: > Kevin Fenzi <kevin@xxxxxxxxx> wrote: >> >> On Fri, 04 Mar 2011 19:07:53 -0500 Gareth Marchant <gareth@xxxxxxxxxxxx> >> wrote: > Does the nagios stage environment operate in an equivalent manner >> to > prod such that testing nagios 3 in stage for these systems would > >> accurately reflect prod? I assume that there are specific monitors > for >> each of these systems that would need to be exercised? I can only > imagine >> what that list will look like... https://admin.stg.fedoraproject.org/nagios/ >> You can see that it can't reach/monitor a lot of the things that the real >> instance does. The stg env just doesn't have access to all the things it >> would need outside it. kevin >> ________________________________ >> infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx >> https://admin.fedoraproject.org/mailman/listinfo/infrastructure > > How about devices? I am sure there are routers, switches, gateways, > firewalls and maybe storage hardware monitored by nagios that are high > priority/highly critical and worthy of test? We don't control 99.999% of them and have no access to the beyond pinging them. In many ways our infrastructure is very much a "cloud". We have systems but everything else is outsourced :). The storage hardware we can monitor is pretty much the Equalogics that releng has. Everything else we get through closed firewalled off networks. > How deeply should testing go or, put another way, how much go-live risk can > be tolerated? Should a gap analysis of stage environment to production be > performed prior to making a nagios test plan? I am not sure how rigorously > structured this upgrade plan should be! > If gap analysis or other items are itches you like to scratch we can work them into version 2 of the test plan(s). It would be a good training exercise for people to see how its done (as I only know it from consultants who were not doing it right according to the next set of consultants.) If they are not things you like to touch with a 10 foot pole, I have no want to make a volunteer spend time on them. Our go-live risk tolerance is pretty high as we have done upgrades with no test plan for 6-7 years now. The goal here is to start from something a bit more complex than "does the web page have errors, no then we are good." because we have grown to be more complex and end up with 4-8 hour periods of "well darn I completely forgot that." So I expect that we will have many lessons learned after each to say "we will add this to testing next time." and then be able to do so. I guess what I am saying is lets do enough that it fits on an ipad web-page the first time and make it more complex as we go. My general philosophy for people volunteering time on Fedora is: Rule 1: Do good work for others as you would want them to do for you. Rule 2: Have Fun Rule 3: Keep true to Freedom, Friends, First, and Features without breaking 1 or 2. So don't stress over the test plan if it misses a bunch of stuff. [I am saying this out loud because I usually get stressed over such stuff and have to remind myself :).] My main hope is to learn how to do our stuff better incrementally. I hope this helps better outline what we need to start with. If a deadline would work better, I would like to have Nagios be ready to go live by the first of April. What do we need to have noc01.stg tested by March 28th? -- Stephen J Smoogen. "The core skill of innovators is error recovery, not failure avoidance." Randy Nelson, President of Pixar University. "Let us be kind, one to another, for most of us are fighting a hard battle." -- Ian MacLaren _______________________________________________ infrastructure mailing list infrastructure@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/infrastructure