monitoring rebuild

Thu Feb 7 22:40:30 UTC 2013

On Thu, 7 Feb 2013, Kevin Fenzi wrote:

> I'd like to look at revamping our nagios setup to be nicer.
>
> Problems with our current setup:
>
> - It's very easy to make syntax errors that you can only fix with
>  further commits. (ie, no sanity checking at all).

Hard to prevent fat fingers -- some tools exist to emit well 
formed files, and there is a 'preflighting' capability in 
nagios that might be added before applying a VCS commit

> - nagios alerts us too much for things that don't matter or that we
>  cannot do anything about. (isolated network issues, servers not
>  responding in cases where there are no impacts, alerting then
>  recovering very quickly)

That sounds like simple tuneing about alert thresholds and may 
be done iteratively as a given false positive becomes overly 
annoying

> - our dependencies are not right, so we get 50 pages for an issue that
>  is a single network or gateway link being down.

That is solved in the four corners of Nagios which permits 
maintaining 'maps' of what is 'behind' a given node.  Nagios 
does not look, nor complaign further down an impaired path, 
noce configured

> - nagios monitors remote machines over vpn links, but also over non vpn
>  links, leading to us having confusion over what a machine is named in
>  nagios.

Using FQ hostnames, and having a 'dummy' TLD of '.VPN' comes 
to mind

> - Our current setup isn't flexable enough to allow monitoring of non
>  core resources, but we want to do this. (for example, qa machines
>  or secondary arch machines or cloud machines, where we may want
>  different groups to get alerted or manage those machines).

Nagios has the concept of notification sub-groups for clutches 
of machines

> - When we add new machines or services we often forget to add
>  monitoring, or we don't monitor all the things we should be.

That is more on the matter of a problem in the accession 
and de-cessions checklists, no?

> - When we do work on machines we sometimes forget to silence nagios.

<hal> 'This kind of thing has happened before, and it has 
always been ... human error' </hal voice>

... snip here, mention of yet more technological tools ...

Nagios may or may not be the right solution, but it is not 
very loady and works better than at least 80 pct of the 
alternatives I've tried, I'd say -- we run (locally and 
externally) Zabbix, bugzilla, smokeping, cacti, OpenNMS, a 
local wiki from tarballs, a local mediawiki, local custom SNMP 
trapping, custom DB backed outage tracking and remediation -- 
and each has faults or things I wish they did differently. 
At least Nagios is reasonably well extensible

I have a couple of bugs open on some of those tools in EPEL 
and they have not resulted in any appearent change, so it may 
simply be that the talent is too thin, and I do not have a 
solution for that

Most of what you outlined as problems is either a usage or 
training issue, or a systematic sysadmin issue -- ladling 
puppet or cfengine, or ansible, or (name the new and shiny 
tool of the day) -- a technology solution -- won't solve 
people issues (not reading, not testing, forcing commits of 
broken setups, and room for design and doco improvements)

-- Russ herrold