On 5 October 2010 16:44, Mike McGrath <mmcgrath(a)redhat.com> wrote:
In an effort to further hide the fas issues we've been running
into I've
added an event handler to the app servers. A brief description of the
problem is when fas hangs, app server httpd processes stack up. When they
do they become unresponsive.
Currently nagios does this on failure:
Failed check 1: nothing (Soft)
Failed check 2: nothing (Soft)
Failed check 3: Send notification (hard)
Once it hits that hard state, nagios claims its dead. We get paged, the
alert shows up in #fedora-noc. Doom.
Now what it does is this:
Failed check 1: nothing (Soft)
Failed Check 2: send notification to #fedora-noc, issue a service httpd
reload
Failed Check 3: Send paged / emailed notifications, issue a service httpd
restart
This is a very different change from how things were and as such we should
track this closely. The reason for the notification issue to #fedora-noc
is to ensure things aren't auto-correcting without us knowing. But at the
same time we're not generating a lot of un-needed email / paged alerts.
I'm going to let this run for a while and lets see how it goes.
pkgdb, for whatever reason, has always been an excellent canary which is
why I'm checking it.
Hi:
It looks OK to me, but, do you've stats about how many time you get a
2nd fail check without reaching a 3rd? I'm thinking in network
micro-outage, load peaks or something funny in the server. Maybe it
needs to be a 4 checks service (reload at third).
In the other hand, it's just a reload of httpd ;)
Regards
--
"My name is Ozymandias, king of kings:
Look on my works, ye Mighty, and despair!"
Percy Bysshe Shelley
http://sites.google.com/site/carlossepulveda