How much downtime do we afford for nagios?

Nigel Jones dev at nigelj.com
Sun Apr 27 07:00:44 UTC 2008


>>  > So if a service or host is unreachable for 3 or 4 mins, we get a
>>  > notification. (However most of the cases it is false positive, due to
>>  > congestion or others).
>>  Looking through my email, from what I can recall there are no false
>>  positives.  xen6 had to be power-cycled which caused all the other
>>  collateral notifications.
>
>
> How long was it down?  Why should a normal reboot will send 23 mails?
> Reboot is not any exceptional thing. Is it?
> An alert should be when its absolutely necessary...
> it should report only  when xen6 comes up but a service does not come up..
> What do you think?
> Thanks.
Remembering that unresponsive and down are different things it looks like
it went unresponsive ~0210 UTC (2-3 minutes before first email) - I
*think* this might have just being domU's at that point, from IRC logs it
looks like the dom0 was rebooted sometime around 0228 (potentially before
hand I do not know).

It's 1 email per checked item for down/up and I guess in perspective, it
was quite big...

IMO these reports are 'absolutely necessary' and I personally like to
check it every now and then (especially after an outage like this to see
if everything was back up (service/host overview on nagios web is handy
for this).

- Nigel
>
>
>
> --
> Regards,
> Susmit.
>
> =============================================
> ssh
> 0x86DD170A
> http://www.fedoraproject.org/wiki/SusmitShannigrahi
> =============================================
>
> _______________________________________________
> Fedora-infrastructure-list mailing list
> Fedora-infrastructure-list at redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list
>





More information about the infrastructure mailing list