How much downtime do we afford for nagios?

Nigel Jones dev at nigelj.com
Sun Apr 27 06:09:21 UTC 2008


> Hi,
Hi,
>
> For a few days false notification of nagios reduced. But it has increased
> again.
You sure?
>
> Looking at the /configs/system/nagios/services/template.cfg reveals
> that it is configured as
> max_check_attempt = 4 and retry_check_interval  1 for hosts
> and
>  max_check_attempts = 3 and retry_check_interval  1.
>
> So if a service or host is unreachable for 3 or 4 mins, we get a
> notification. (However most of the cases it is false positive, due to
> congestion or others).
Looking through my email, from what I can recall there are no false
positives.  xen6 had to be power-cycled which caused all the other
collateral notifications.

Just to put it into perspective...
1st notification: 0212UTC - Accounts down on .120-phx
...
5th notification: 0216UTC - UNKNOWN status on xen6 (NRPE: Unable to read
output)
...
11/12th notifications: 0228UTC - Host Down - xen6/db2
& Starting 0233UTC - Host/service UP/Okay notifications

According to my IRC logs xen6 went a bit haywire and had to be rebooted,
so TBH I don't see what is false here.


Yes congestion can cause some problems, but isn't that also a sign that
stuff may need to be balanced better or given more processing/networking
capacity.

It's long enough to not detect every single VPN bloop, but it's also long
enough to give an idea of problems.
>
> How about finding out a working delay which we can afford, if a
> service or host is really down. How about 10 mins ? (5 attempt x 2
> mins?).
IMO this is too long, also, it doesn't take that long for someone to SSH
in and have a quick look, I don't speak for everyone, but I don't mind if
I spend 2-5 minutes to check.
>
> Also we may list services/host which are critical and which are not.
> That will help to define different notification period for the
> different hots/services.
>
> I thought I shall do it after the freeze, but its becoming too annoying.
Personally, I don't think anything should be done at the moment.

- Nigel




More information about the infrastructure mailing list