On Fri, Mar 4, 2011 at 18:31, Gareth Marchant gareth@litehaus.net wrote:
Kevin Fenzi kevin@scrye.com wrote:
On Fri, 04 Mar 2011 19:07:53 -0500 Gareth Marchant gareth@litehaus.net wrote: > Does the nagios stage environment operate in an equivalent manner to > prod such that testing nagios 3 in stage for these systems would > accurately reflect prod? I assume that there are specific monitors > for each of these systems that would need to be exercised? I can only > imagine what that list will look like... https://admin.stg.fedoraproject.org/nagios/ You can see that it can't reach/monitor a lot of the things that the real instance does. The stg env just doesn't have access to all the things it would need outside it. kevin ________________________________ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
How about devices? I am sure there are routers, switches, gateways, firewalls and maybe storage hardware monitored by nagios that are high priority/highly critical and worthy of test?
We don't control 99.999% of them and have no access to the beyond pinging them. In many ways our infrastructure is very much a "cloud". We have systems but everything else is outsourced :).
The storage hardware we can monitor is pretty much the Equalogics that releng has. Everything else we get through closed firewalled off networks.
How deeply should testing go or, put another way, how much go-live risk can be tolerated? Should a gap analysis of stage environment to production be performed prior to making a nagios test plan? I am not sure how rigorously structured this upgrade plan should be!
If gap analysis or other items are itches you like to scratch we can work them into version 2 of the test plan(s). It would be a good training exercise for people to see how its done (as I only know it from consultants who were not doing it right according to the next set of consultants.) If they are not things you like to touch with a 10 foot pole, I have no want to make a volunteer spend time on them.
Our go-live risk tolerance is pretty high as we have done upgrades with no test plan for 6-7 years now. The goal here is to start from something a bit more complex than "does the web page have errors, no then we are good." because we have grown to be more complex and end up with 4-8 hour periods of "well darn I completely forgot that."
So I expect that we will have many lessons learned after each to say "we will add this to testing next time." and then be able to do so. I guess what I am saying is lets do enough that it fits on an ipad web-page the first time and make it more complex as we go.
My general philosophy for people volunteering time on Fedora is: Rule 1: Do good work for others as you would want them to do for you. Rule 2: Have Fun Rule 3: Keep true to Freedom, Friends, First, and Features without breaking 1 or 2.
So don't stress over the test plan if it misses a bunch of stuff. [I am saying this out loud because I usually get stressed over such stuff and have to remind myself :).] My main hope is to learn how to do our stuff better incrementally.
I hope this helps better outline what we need to start with. If a deadline would work better, I would like to have Nagios be ready to go live by the first of April. What do we need to have noc01.stg tested by March 28th?