Top 10 services/servers/etc

Sat Mar 5 23:06:18 UTC 2011

On Fri, Mar 4, 2011 at 18:31, Gareth Marchant <gareth at litehaus.net> wrote:
> Kevin Fenzi <kevin at scrye.com> wrote:
>>
>> On Fri, 04 Mar 2011 19:07:53 -0500 Gareth Marchant <gareth at litehaus.net>
>> wrote: > Does the nagios stage environment operate in an equivalent manner
>> to > prod such that testing nagios 3 in stage for these systems would >
>> accurately reflect prod? I assume that there are specific monitors > for
>> each of these systems that would need to be exercised? I can only > imagine
>> what that list will look like... https://admin.stg.fedoraproject.org/nagios/
>> You can see that it can't reach/monitor a lot of the things that the real
>> instance does. The stg env just doesn't have access to all the things it
>> would need outside it. kevin
>> ________________________________
>> infrastructure mailing list infrastructure at lists.fedoraproject.org
>> https://admin.fedoraproject.org/mailman/listinfo/infrastructure
>
> How about devices? I am sure there are routers, switches, gateways,
> firewalls and maybe storage hardware monitored by nagios that are high
> priority/highly critical and worthy of test?

We don't control 99.999% of them and have no access to the beyond
pinging them. In many ways our infrastructure is very much a "cloud".
We have systems but everything else is outsourced :).

The storage hardware we can monitor is pretty much the Equalogics that
releng has. Everything else we get through closed firewalled off
networks.

> How deeply should testing go or, put another way, how much go-live risk can
> be tolerated? Should a gap analysis of stage environment to production be
> performed prior to making a nagios test plan? I am not sure how rigorously
> structured this upgrade plan should be!
>

If gap analysis or other items are itches you like to scratch we can
work them into version 2 of the test plan(s). It would be a good
training exercise for people to see how its done (as I only know it
from consultants who were not doing it right according to the next set
of consultants.) If they are not things you like to touch with a 10
foot pole, I have no want to make a volunteer spend time on them.

Our go-live risk tolerance is pretty high as we have done upgrades
with no test plan for 6-7 years now. The goal here is to start from
something a bit more complex than "does the web page have errors, no
then we are good." because we have grown to be more complex and end up
with 4-8 hour periods of "well darn I completely forgot that."

So I expect that we will have many lessons learned after each to say
"we will add this to testing next time." and then be able to do so. I
guess what I am saying is lets do enough that it fits on an ipad
web-page the first time and make it more complex as we go.

My general philosophy for people volunteering time on Fedora is:
Rule 1: Do good work for others as you would want them to do for you.
Rule 2: Have Fun
Rule 3: Keep true to Freedom, Friends, First, and Features without
breaking 1 or 2.

So don't stress over the test plan if it misses a bunch of stuff. [I
am saying this out loud because I usually get stressed over such stuff
and have to remind myself :).] My main hope is to learn how to do our
stuff better incrementally.

I hope this helps better outline what we need to start with. If a
deadline would work better, I would like to have Nagios be ready to go
live by the first of April. What do we need to have noc01.stg tested
by March 28th?

-- 
Stephen J Smoogen.
"The core skill of innovators is error recovery, not failure avoidance."
Randy Nelson, President of Pixar University.
"Let us be kind, one to another, for most of us are fighting a hard
battle." -- Ian MacLaren