How much downtime do we afford for nagios?
dev at nigelj.com
Sun Apr 27 07:00:44 UTC 2008
>> > So if a service or host is unreachable for 3 or 4 mins, we get a
>> > notification. (However most of the cases it is false positive, due to
>> > congestion or others).
>> Looking through my email, from what I can recall there are no false
>> positives. xen6 had to be power-cycled which caused all the other
>> collateral notifications.
> How long was it down? Why should a normal reboot will send 23 mails?
> Reboot is not any exceptional thing. Is it?
> An alert should be when its absolutely necessary...
> it should report only when xen6 comes up but a service does not come up..
> What do you think?
Remembering that unresponsive and down are different things it looks like
it went unresponsive ~0210 UTC (2-3 minutes before first email) - I
*think* this might have just being domU's at that point, from IRC logs it
looks like the dom0 was rebooted sometime around 0228 (potentially before
hand I do not know).
It's 1 email per checked item for down/up and I guess in perspective, it
was quite big...
IMO these reports are 'absolutely necessary' and I personally like to
check it every now and then (especially after an outage like this to see
if everything was back up (service/host overview on nagios web is handy
> Fedora-infrastructure-list mailing list
> Fedora-infrastructure-list at redhat.com
More information about the infrastructure