Re: Top 10 services/servers/etc

5 Mar 2011


      On Fri, Mar 4, 2011 at 18:31, Gareth Marchant gareth@litehaus.net wrote:
...
Kevin Fenzi kevin@scrye.com wrote:
...
On Fri, 04 Mar 2011 19:07:53 -0500 Gareth Marchant gareth@litehaus.net
wrote: > Does the nagios stage environment operate in an equivalent manner
to > prod such that testing nagios 3 in stage for these systems would >
accurately reflect prod? I assume that there are specific monitors > for
each of these systems that would need to be exercised? I can only > imagine
what that list will look like... https://admin.stg.fedoraproject.org/nagios/
You can see that it can't reach/monitor a lot of the things that the real
instance does. The stg env just doesn't have access to all the things it
would need outside it. kevin
________________________________
infrastructure mailing list infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
How about devices? I am sure there are routers, switches, gateways,
firewalls and maybe storage hardware monitored by nagios that are high
priority/highly critical and worthy of test?
We don't control 99.999% of them and have no access to the beyond
pinging them. In many ways our infrastructure is very much a "cloud".
We have systems but everything else is outsourced :).
The storage hardware we can monitor is pretty much the Equalogics that
releng has. Everything else we get through closed firewalled off
networks.
...
How deeply should testing go or, put another way, how much go-live risk can
be tolerated? Should a gap analysis of stage environment to production be
performed prior to making a nagios test plan? I am not sure how rigorously
structured this upgrade plan should be!
If gap analysis or other items are itches you like to scratch we can
work them into version 2 of the test plan(s). It would be a good
training exercise for people to see how its done (as I only know it
from consultants who were not doing it right according to the next set
of consultants.) If they are not things you like to touch with a 10
foot pole, I have no want to make a volunteer spend time on them.
Our go-live risk tolerance is pretty high as we have done upgrades
with no test plan for 6-7 years now. The goal here is to start from
something a bit more complex than "does the web page have errors, no
then we are good." because we have grown to be more complex and end up
with 4-8 hour periods of "well darn I completely forgot that."
So I expect that we will have many lessons learned after each to say
"we will add this to testing next time." and then be able to do so. I
guess what I am saying is lets do enough that it fits on an ipad
web-page the first time and make it more complex as we go.
My general philosophy for people volunteering time on Fedora is:
Rule 1: Do good work for others as you would want them to do for you.
Rule 2: Have Fun
Rule 3: Keep true to Freedom, Friends, First, and Features without
breaking 1 or 2.
So don't stress over the test plan if it misses a bunch of stuff. [I
am saying this out loud because I usually get stressed over such stuff
and have to remind myself :).] My main hope is to learn how to do our
stuff better incrementally.
I hope this helps better outline what we need to start with. If a
deadline would work better, I would like to have Nagios be ready to go
live by the first of April. What do we need to have noc01.stg tested
by March 28th?
-- 
Stephen J Smoogen.
"The core skill of innovators is error recovery, not failure avoidance."
Randy Nelson, President of Pixar University.
"Let us be kind, one to another, for most of us are fighting a hard
battle." -- Ian MacLaren

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: Top 10 services/servers/etc