monitoring rebuild

Thu Feb 7 22:16:24 UTC 2013

Greetings. 

I'd like to look at revamping our nagios setup to be nicer. 

Problems with our current setup: 

- It's very easy to make syntax errors that you can only fix with
  further commits. (ie, no sanity checking at all). 

- nagios alerts us too much for things that don't matter or that we
  cannot do anything about. (isolated network issues, servers not
  responding in cases where there are no impacts, alerting then
  recovering very quickly)

- our dependencies are not right, so we get 50 pages for an issue that
  is a single network or gateway link being down. 

- nagios monitors remote machines over vpn links, but also over non vpn
  links, leading to us having confusion over what a machine is named in
  nagios. 

- Our current setup isn't flexable enough to allow monitoring of non
  core resources, but we want to do this. (for example, qa machines
  or secondary arch machines or cloud machines, where we may want
  different groups to get alerted or manage those machines). 

- When we add new machines or services we often forget to add
  monitoring, or we don't monitor all the things we should be. 

- When we do work on machines we sometimes forget to silence nagios. 

Requirements: 

- syntax check commits and reject them where they would break things. 
- only send pages for things that impact either end users or
  maintainers. Things that don't impact either of those should
  mail/notify, but not page. 
- correct dependencies so if one thing is down all that block on it
  really does not alert. 
- allow non FI groups to have machines/networks they manage. 
- When adding new machine/service, monitoring should be automatically
  configured if at all possible. 
- Possibly have some kind of escalations setup... page, wait X min for
  an ack, if not page again or page more people, etc. 
- When doing work have an easy way to silence alerts for affected
  machines. 

Some possible parts of the solution: 

- check_mk - This would allow us to just install and query it, and it
  would monitor all the running process, etc. Could very much simplify
  config on normal nodes. 

- ansible playbooks to disable notifications or alerts in cases where
  thats needed easily. We can use the web interface, but a simple
  script to do this would be nice to call from other playbooks. This
  needs consistent naming for hosts. 

- commit hook to check syntax. This will require nagios on lockbox01
  and installing to a tree or something, unless there's a way we can
  pull the syntax checking out of nagios. Does python-nagios package do
  it? 

- Identify those items/hosts that are ones that should page, assume
  rest are email/notice only. The mass reboot sop has these already in
  the Class A/B hosts listed there. I can easily generate a list. 

I'm sure there's more things we can do here, and several people have
looked into this problem in the past. Its not a simple one, but I think
it's a good one to work on and fix. 

I'm not sure if it makes sense to make a new nagios git repo (like we
did for dns) or just move it over to ansible repo. I kind of think it
would be nicer just in ansible repo. 

This is just prelim ideas on the scope and problem... I'll look at
starting on some more detailed work soon. 

kevin

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20130207/bc0870be/attachment.sig>