Nagios event handlers

Wed Oct 6 15:35:12 UTC 2010

On Wed, 6 Oct 2010, Carlos (casep) Sepulveda wrote:

> On 5 October 2010 16:44, Mike McGrath <mmcgrath at redhat.com> wrote:
> > In an effort to further hide the fas issues we've been running into I've
> > added an event handler to the app servers.  A brief description of the
> > problem is when fas hangs, app server httpd processes stack up.  When they
> > do they become unresponsive.
> >
> > Currently nagios does this on failure:
> >
> > Failed check 1: nothing (Soft)
> > Failed check 2: nothing (Soft)
> > Failed check 3: Send notification (hard)
> >
> > Once it hits that hard state, nagios claims its dead.  We get paged, the
> > alert shows up in #fedora-noc.  Doom.
> >
> > Now what it does is this:
> >
> > Failed check 1: nothing (Soft)
> > Failed Check 2: send notification to #fedora-noc, issue a service httpd
> >      reload
> > Failed Check 3: Send paged / emailed notifications, issue a service httpd
> >      restart
> >
> >
> > This is a very different change from how things were and as such we should
> > track this closely.  The reason for the notification issue to #fedora-noc
> > is to ensure things aren't auto-correcting without us knowing.  But at the
> > same time we're not generating a lot of un-needed email / paged alerts.
> > I'm going to let this run for a while and lets see how it goes.
> >
> > pkgdb, for whatever reason, has always been an excellent canary which is
> > why I'm checking it.
> >
>
>
> Hi:
> It looks OK to me, but, do you've stats about how many time you get a
> 2nd fail check without reaching a 3rd? I'm thinking in network
> micro-outage, load peaks or something funny in the server. Maybe it
> needs to be a 4 checks service (reload at third).
> In the other hand, it's just a reload of httpd ;)
>

I had considered doing a graceful but my feeling is it wouldn't recover
quickly enough for this change to have a full impact :-/

       -Mike