Greetings.
I'd like to look at revamping our nagios setup to be nicer.
Problems with our current setup:
- It's very easy to make syntax errors that you can only fix with further commits. (ie, no sanity checking at all).
- nagios alerts us too much for things that don't matter or that we cannot do anything about. (isolated network issues, servers not responding in cases where there are no impacts, alerting then recovering very quickly)
- our dependencies are not right, so we get 50 pages for an issue that is a single network or gateway link being down.
- nagios monitors remote machines over vpn links, but also over non vpn links, leading to us having confusion over what a machine is named in nagios.
- Our current setup isn't flexable enough to allow monitoring of non core resources, but we want to do this. (for example, qa machines or secondary arch machines or cloud machines, where we may want different groups to get alerted or manage those machines).
- When we add new machines or services we often forget to add monitoring, or we don't monitor all the things we should be.
- When we do work on machines we sometimes forget to silence nagios.
Requirements:
- syntax check commits and reject them where they would break things. - only send pages for things that impact either end users or maintainers. Things that don't impact either of those should mail/notify, but not page. - correct dependencies so if one thing is down all that block on it really does not alert. - allow non FI groups to have machines/networks they manage. - When adding new machine/service, monitoring should be automatically configured if at all possible. - Possibly have some kind of escalations setup... page, wait X min for an ack, if not page again or page more people, etc. - When doing work have an easy way to silence alerts for affected machines.
Some possible parts of the solution:
- check_mk - This would allow us to just install and query it, and it would monitor all the running process, etc. Could very much simplify config on normal nodes.
- ansible playbooks to disable notifications or alerts in cases where thats needed easily. We can use the web interface, but a simple script to do this would be nice to call from other playbooks. This needs consistent naming for hosts.
- commit hook to check syntax. This will require nagios on lockbox01 and installing to a tree or something, unless there's a way we can pull the syntax checking out of nagios. Does python-nagios package do it?
- Identify those items/hosts that are ones that should page, assume rest are email/notice only. The mass reboot sop has these already in the Class A/B hosts listed there. I can easily generate a list.
I'm sure there's more things we can do here, and several people have looked into this problem in the past. Its not a simple one, but I think it's a good one to work on and fix.
I'm not sure if it makes sense to make a new nagios git repo (like we did for dns) or just move it over to ansible repo. I kind of think it would be nicer just in ansible repo.
This is just prelim ideas on the scope and problem... I'll look at starting on some more detailed work soon.
kevin
On Thu, 7 Feb 2013, Kevin Fenzi wrote:
I'd like to look at revamping our nagios setup to be nicer.
Problems with our current setup:
- It's very easy to make syntax errors that you can only fix with
further commits. (ie, no sanity checking at all).
Hard to prevent fat fingers -- some tools exist to emit well formed files, and there is a 'preflighting' capability in nagios that might be added before applying a VCS commit
- nagios alerts us too much for things that don't matter or that we
cannot do anything about. (isolated network issues, servers not responding in cases where there are no impacts, alerting then recovering very quickly)
That sounds like simple tuneing about alert thresholds and may be done iteratively as a given false positive becomes overly annoying
- our dependencies are not right, so we get 50 pages for an issue that
is a single network or gateway link being down.
That is solved in the four corners of Nagios which permits maintaining 'maps' of what is 'behind' a given node. Nagios does not look, nor complaign further down an impaired path, noce configured
- nagios monitors remote machines over vpn links, but also over non vpn
links, leading to us having confusion over what a machine is named in nagios.
Using FQ hostnames, and having a 'dummy' TLD of '.VPN' comes to mind
- Our current setup isn't flexable enough to allow monitoring of non
core resources, but we want to do this. (for example, qa machines or secondary arch machines or cloud machines, where we may want different groups to get alerted or manage those machines).
Nagios has the concept of notification sub-groups for clutches of machines
- When we add new machines or services we often forget to add
monitoring, or we don't monitor all the things we should be.
That is more on the matter of a problem in the accession and de-cessions checklists, no?
- When we do work on machines we sometimes forget to silence nagios.
<hal> 'This kind of thing has happened before, and it has always been ... human error' </hal voice>
... snip here, mention of yet more technological tools ...
Nagios may or may not be the right solution, but it is not very loady and works better than at least 80 pct of the alternatives I've tried, I'd say -- we run (locally and externally) Zabbix, bugzilla, smokeping, cacti, OpenNMS, a local wiki from tarballs, a local mediawiki, local custom SNMP trapping, custom DB backed outage tracking and remediation -- and each has faults or things I wish they did differently. At least Nagios is reasonably well extensible
I have a couple of bugs open on some of those tools in EPEL and they have not resulted in any appearent change, so it may simply be that the talent is too thin, and I do not have a solution for that
Most of what you outlined as problems is either a usage or training issue, or a systematic sysadmin issue -- ladling puppet or cfengine, or ansible, or (name the new and shiny tool of the day) -- a technology solution -- won't solve people issues (not reading, not testing, forcing commits of broken setups, and room for design and doco improvements)
-- Russ herrold
On Thu, 7 Feb 2013 17:40:30 -0500 (EST) R P Herrold herrold@owlriver.com wrote:
Hard to prevent fat fingers -- some tools exist to emit well formed files, and there is a 'preflighting' capability in nagios that might be added before applying a VCS commit
Yep. We just need to hook it up...
- nagios alerts us too much for things that don't matter or that we
cannot do anything about. (isolated network issues, servers not responding in cases where there are no impacts, alerting then recovering very quickly)
That sounds like simple tuneing about alert thresholds and may be done iteratively as a given false positive becomes overly annoying
Yes.
- our dependencies are not right, so we get 50 pages for an issue
that is a single network or gateway link being down.
That is solved in the four corners of Nagios which permits maintaining 'maps' of what is 'behind' a given node. Nagios does not look, nor complaign further down an impaired path, noce configured
Yep. This is complicated however by the following:
we use nrpe to monitor/show hosts alive. This nrpe connection is over our vpn.
foobar.vpn.fedoraproject.org could be down because: a) The machine really is down. b) The host it's running on is down, but the vpn doesn't have anything to monitor on that host, so you can't add a dep for 'foobars-host.fedoraproject.org' easily/transparently. c) The network to that site is down.
I think this is fixable, but requires some careful rethinking.
- nagios monitors remote machines over vpn links, but also over non
vpn links, leading to us having confusion over what a machine is named in nagios.
Using FQ hostnames, and having a 'dummy' TLD of '.VPN' comes to mind
That doesn't help. See above. ;) We do have a vpn subdomain.
- Our current setup isn't flexable enough to allow monitoring of non
core resources, but we want to do this. (for example, qa machines or secondary arch machines or cloud machines, where we may want different groups to get alerted or manage those machines).
Nagios has the concept of notification sub-groups for clutches of machines
Yep. Again, it's a matter of setting it up and making it easy to manage.
- When we add new machines or services we often forget to add
monitoring, or we don't monitor all the things we should be.
That is more on the matter of a problem in the accession and de-cessions checklists, no?
yes, but it also should get automated as much as possble, as people sometimes miss things on checklists, but if it was automatically setup when a machine was added there would be less errors.
- When we do work on machines we sometimes forget to silence nagios.
<hal> 'This kind of thing has happened before, and it has always been ... human error' </hal voice>
Indeed. :)
... snip here, mention of yet more technological tools ...
Nagios may or may not be the right solution, but it is not very loady and works better than at least 80 pct of the alternatives I've tried, I'd say -- we run (locally and externally) Zabbix, bugzilla, smokeping, cacti, OpenNMS, a local wiki from tarballs, a local mediawiki, local custom SNMP trapping, custom DB backed outage tracking and remediation -- and each has faults or things I wish they did differently. At least Nagios is reasonably well extensible
Yeah, we tried zabbix a while back and it never made it to production. Big issues: only manageable via web interface, couldn't handle the load very well, difficult to add new checks to a wide pile of hosts.
I have a couple of bugs open on some of those tools in EPEL and they have not resulted in any appearent change, so it may simply be that the talent is too thin, and I do not have a solution for that
Indeed. Could well be.
Most of what you outlined as problems is either a usage or training issue, or a systematic sysadmin issue -- ladling puppet or cfengine, or ansible, or (name the new and shiny tool of the day) -- a technology solution -- won't solve people issues (not reading, not testing, forcing commits of broken setups, and room for design and doco improvements)
Yes, I didn't say this would be solved by magic tools. I just noted that we needed to revamp/clean up our config and do what we could to automate things at the same time.
Thanks for the input.
kevin
infrastructure@lists.fedoraproject.org