infrastructure August 2008

infrastructure@lists.fedoraproject.org

55 participants
53 discussions

by Karsten Wade

Do we want to turn this page static for the Alpha release? https://fedoraproject.org/wiki/Releases/10/Alpha/ReleaseNotes I put out the last call for changes and am making a few myself. Is there a good target for calling to static for now? We'll want to make sure we can edit it. Perhaps as part of making it static we can tell people to post changes to the talk page, then we'll do irregular updates of the static content? Thanks - Karsten -- Karsten Wade, Sr. Developer Community Mgr. Dev Fu : http://developer.redhatmagazine.com Fedora : http://quaid.fedorapeople.org gpg key : AD0E0C41

15 years, 8 months

4
3
0 / 0

serverbeach2 and serverbeach3 reboot

by Ricky Zhou

serverbeach2 (asterisk1, collab1, ns1) and serverbeach3 (asterisk2, collab2) mysteriously rebooted ~30 minutes ago. Could it have been a power problem of some sort? There were no useful entries in /var/log/messages or anything. If we don't find any leads about this tomorrow, could it be something worth asking ServerBeach about? Thanks, Ricky

15 years, 8 months

2
4
0 / 0

Server Monitoring - A replacement for Nagios?

by Nigel Jones

Okay, so while this was intended to be a primary discussion point for tomorrows Infrastructure meeting we had a little bit of discussion first in #fedora-admin, and then in #fedora-meeting regarding Zabbix, a tool like Nagios that I begun to setup for testing this week. In summary the discussion ended positively we think it will do the job quite well and we really need to now sit down and work out if we want to try implementing it on a limited scale in parallel with Nagios (to act as a comparison). The related part of the agenda for tomorrow now will be: * Do we want to push this into a limited trial (say 10 key-ish machines in our infrastructure) * How long would such a trial last for * What are we going to use as a metric for such a trial * Are there other concerns Personally I'd like to see this as a step forward in revamping sysadmin-noc so we can reduce the work load on members in sysadmin-main. Review the log below. -- Nigel 10:43 < mmcgrath> G: so whats your take on how big the zabbix db will get? Should we put it on db1 or on its own box? 10:43 < mmcgrath> if its on its own (probably the same one zabbix is on) we're lowering points of failure, but we might have to re-spec noc1 and noc2. 10:43 < G> mmcgrath: I'm not sure 10:44 < G> this is where it's great to have people like wakko666 and jcollie who use it atDAYJOB 10:44 < mmcgrath> yeah. 10:44 < G> maybe we need a nocdb1 10:44 < mmcgrath> If it needs to be pretty quick once it gets full of stuff, we might justwant to put it on db1. 10:44 < mmcgrath> if it stays light though, we'll probably just keep it localhost to noc1 and give noc1 more ram / disk space. 10:45 < wakko666> mmcgrath: the rate of growth for the zabbix DB directly depends on the poll rates for all of the checks 10:45 < mmcgrath> wakko666: if you don't mind my asking.... how many hosts do you have andhow big is the db? 10:45 < G> yeah, from what I can tell also, zabbix does it's own housekeeping to try and consolidate some of the data 10:45 < mmcgrath> and how much stuff do you monitor? pretty default stuff? or more then the default. 10:46 < wakko666> we have 50 hosts in production, and another 100 hosts outside that across two zabbix nodes 10:47 < mmcgrath> wakko666: is the two zabbix nodes for high availability or was it because one zabbix node couldn't handle the traffic? 10:47 < wakko666> it's because they're in different locations 10:47 < mmcgrath> <nod> 10:47 < mmcgrath> Do you have a dedicated db? How big is the raw database? 10:48 < fchiulli_> mmcgrath: I'm assuming that part of the discussion will be whether to have more than one zabbix monitoring host. 10:48 < wakko666> mmcgrath: we've got a dedicated mysql db for each node. the production data is currently around 10-20 GB, the non-production node sits at around 40-50 GB 10:48 < wakko666> the key thing to note is that zabbix keeps data in two forms, with tunable knobs for each. 10:48 < dgilmore> wakko666: over what time period? 10:48 < mmcgrath> wakko666: you don't happen to have sar data for those hosts you could give to me would you? :) 10:49 < wakko666> dgilmore: we're at about 3-4 months right now 10:49 < mmcgrath> I suppose we can start out small and move it later... its not really that big of a risk. 10:49 < wakko666> mmcgrath: unfortunately, today was my last day there. i was "reorganized" out of a job. ;-) 10:49 < dgilmore> wakko666: so your anticipating up to 80gb for production a year? 10:49 < mmcgrath> wakko666: doah, well... hope all is well. 10:50 < wakko666> dgilmore: sort of. as i was saying, there are two knobs. poll data, and trend data. 10:50 < wakko666> typically, we keep all polled data for about 7 days worth, then only keep trend data after that 10:50 < ricky> mmcgrath: IT's in now :-) 10:50 < dgilmore> much like cacti does 10:50 < mmcgrath> ricky: hilarious. 10:50 < wakko666> mmcgrath: yeah, i'll probably be fine. though, i wouldn't mind findinga spot at RH. ;-) 10:51 < G> wakko666: wait a second, I thought if you setup multiple nodes they could sharethe same tasks? 10:51 < mmcgrath> and, correct me if I'm wrong, but zabbix doesn't store RRD right? the graphs come from the database? 10:51 < G> mmcgrath: correct from what I can tell 10:51 < wakko666> mmcgrath: correct. graphs are auto-generated, not RRD. so you can create new graphs and they're autopopulated with old data 10:52 < wakko666> G: yes, nodes share the data from the tasks. the zabbix-agent.conf and zabbix-server.conf help configure which node performs the polling 10:52 < mmcgrath> wakko666: were you using auto-recovery services? 10:52 < G> wakko666: k, so it's one big db and you just assign hosts to each node? 10:53 < wakko666> mmcgrath: auto-recovery? not sure what you mean. perhaps you mean auto-discovery? 10:53 < G> wakko666: remote commands :) 10:53 < mmcgrath> wakko666: like if httpd dies on an app server, have zabbix restart it. 10:53 < wakko666> G: can be, or you can set up a db per node, or db on some nodes and not others. it's pretty flexible 10:54 < wakko666> mmcgrath: ah ha! yeah, you can have zabbix execute commands on healthcheck failure 10:54 < wakko666> really, the big limitation of zabbix is a couple of things 10:54 < G> I'd like to see noc1/noc2 share the zabbix checks 10:54 < wakko666> currently, in 1.4, there's no repeated notifications. one notify is allyou get. 10:54 < G> wakko666: yeah, I noticed that 10:54 < wakko666> (it's coming in 1.6, which is due in Sept) 10:54 < mmcgrath> G: yeah, I'm totally fine re-thinking how we have our noc's setup. The big things I want are: 10:55 < mmcgrath> paged alerts when a service is not available. 10:55 < mmcgrath> and email alerts when an individual service in a farm goes down. 10:55 < G> mmcgrath: yeah 10:55 < wakko666> mmcgrath: yup, no troubles doing those, and you'll likely get finer granularity than with nagios 10:55 < mmcgrath> that got kind of tricky in one nagios instance. 10:55 < G> yep, exactly 10:56 < mmcgrath> well, and even tricker in one nagios instance in PHX :) 10:56 < mmcgrath> wakko666: if there's some services that noc1 can't get to but noc2 can, can you tell zabbix to always check those with noc2? 10:56 < G> mmcgrath: the nice thing is, is that you can run the zabbix-server on more thanone server, and the web interface on totally different servers 10:57 < wakko666> yeah... with multiple nodes, you define checks per node. so you'd configure a particular host on noc2's zabbix node. 10:57 < G> yeah, thats what we really want 10:57 < mmcgrath> yep. 10:57 < G> actually #fedora-meeting is free, shall we have an impromptu there? 10:57 < wakko666> works for me. 10:58 < mmcgrath> G: sure -- Discussion moved to #fedora-meeting -- 10:58 -!- G changed the topic of #fedora-meeting to: sysadmin-noc - System Monitoring Needs 10:58 < mmcgrath> W00t 10:58 < G> ricky: dgilmore: jcollie: you folks around? 10:58 < mmcgrath> G: so I want zabbix to monitor when new versions of my packages are around, build them, and push them via bodhi when new versions are out :) 10:58 * mmcgrath runs 10:59 < wakko666> lol 10:59 < G> mmcgrath: haha :) 10:59 < ricky> G: pongish 10:59 < G> okay, so if you open your hym books to http://publictest3.fedoraproject.org/zabbix/overview.php we have a basic-ish setup atm 11:00 < wakko666> looks like the basic Linux Server template... 11:00 < G> wakko666: yeah :) 11:00 < G> wakko666: except I started moving some of the specific checks like apache into other templates and started linking them 11:00 < dgilmore> G: not really 11:01 < wakko666> G: that works. one suggestion: copy the default graphs for Zabbix Server into the Linux Server template so you get some default graphing for each host 11:01 < mmcgrath> G: any luck getting ahold of fchuili? 11:02 < G> argh, I meant to ping him back before 11:02 < G> dgilmore: no problem :) 11:02 < G> wakko666: ricky: you have accounts there now, irc nick/test 11:02 < mmcgrath> I'll drop him an email 11:03 < G> wakko666: they were in default settings iirc 11:03 < G> oh maybe not 11:03 < mmcgrath> G: you don't happen to know if we can plug this in to FAS do you? 11:03 * dgilmore will note he tried zabbix aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaannd founnnnnnnnnnnnnnnnnnnnd ituseleeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeess hard to configure and didnt work right 11:04 < G> wakko666: okay, done that now 11:04 * dgilmore wonders when ajax will gggget XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx fied:)pretty please 11:04 < G> mmcgrath: I don't think so, kinda like cacti in a way 11:04 < mmcgrath> dgilmore: thats so funny, G set this up in a matter of hours and have noproblems at all ;-) 11:05 < wakko666> i'm not sure about plugging the auth into FAS, but it's PHP so at the very least it should be hackable 11:05 < dgilmore> mmcgrath: monitoring localhost worked 11:05 < dgilmore> mmcgrath: but that was it 11:05 < G> http://publictest3.fedoraproject.org/zabbix/charts.php?period=86400&dec=0... 11:05 < ricky> Worst case, we put it behind basic auth. 11:05 < mmcgrath> dgilmore: time for another look :) 11:05 < G> I like the stuff like that 11:05 < mmcgrath> ricky: yeah, thats what I was thinking 11:05 < dgilmore> mmcgrath: it was about 3 or 4 months ago i think 11:05 < G> dgilmore: I got 4 hosts monitored in no time, only trouble was iptables on the pt machines :) 11:06 < wakko666> one note: stacked graphs occasionally don't render quite right. sometimes zabbix leaves white space between data sets 11:06 < G> wakko666: yeah, but it still shows the trend quite nicely 11:06 < wakko666> G: agreed. 11:07 < wakko666> for the web servers, setting up app-specific web checks is great, and fairly easy to do 11:07 < G> hmmm whats on PT7, it takes a bit of beating 11:08 < G> http://publictest3.fedoraproject.org/zabbix/charts.php?period=43200&dec=0... 11:08 < G> okay, so I think the first thing is: 11:08 < G> What are our requirements? 11:08 < ricky> pt7 looks fine to me 11:08 < G> I can easily add the following: 11:08 < ricky> Ah, it's back in the green now. 11:09 < G> -> Equal checking abilities to nagios (i.e. the type of checks) 11:09 < ricky> Could you walk us through the processes of adding a complex check? 11:09 < G> -> Ability to send out e-mails/pagers 11:09 < ricky> And also, is there any sort of equivalent screen to https://admin.fedoraproject.org/nagios/cgi-bin//status.cgi?host=all&servi... in nagios? 11:09 < mmcgrath> and what is the difference between a "web" check and just your normal check? 11:09 < mmcgrath> how do we write custom plugins? 11:09 < mmcgrath> why is the sky blue? 11:09 < G> -> Ability to customise stuff 11:10 < ricky> (As little information as possible - a view with *just* what problems are going on) 11:10 < G> ricky: yes 11:10 < wakko666> mmcgrath: by 'web check' i mean, a semi-intelligent check of a web-app, where you can set up a series of steps for it to check through such as "hit koji.fp.org,click packages, click builds, etc" 11:10 < G> http://publictest3.fedoraproject.org/zabbix/tr_status.php?onlytrue=true&n... 11:10 < dgilmore> can i just edit a nice easy to read config file to do things? 11:11 < G> dgilmore: and break it while you try to work out why it broke 11:11 < wakko666> dgilmore: all config is done through the zabbix web gui 11:11 < G> errr 11:11 < G> and break it and spend ages working out why you broke it 11:11 < wakko666> ricky: the zabbix equivalent to that nagios screen is the Monitoring ->Overview screen, though under Screens, you can set up a customized view as well. 11:12 < dgilmore> wakko666: to me thats really bad 11:12 < G> wakko666: the triggers = true page is like that too 11:12 < G> (the link I pasted just before) 11:12 * dgilmore personally doesnt like configuring though a web gui. maybe why zabbix did not work out for me 11:12 < wakko666> dgilmore: it's a different paradigm. i don't equate different to bad. not having config files doesn't strike me as a flaw. 11:13 < abadger1999> wakko666: makes it harder to manage it via puppet. 11:14 < wakko666> abadger1999: yes and no. there are config files for the polling server daemon and the client-side agent. at $dayjob, i push the agent configs via puppet 11:14 < abadger1999> <nod. 11:15 < ricky> I guess abadger1999 was referring to things like configuration for specificchecks and things like that 11:15 < G> wakko666: custom checks are defined in the agent config right? 11:15 < wakko666> to me, the big thing with zabbix is that it's essential to back up the db, and export your configs on a regular basis. it's painful to spend hours setting zabbix up, and have your db get corrupted and have to do all that work all over again 11:16 < ricky> So the exciting question: What problems that we're seeing with nagios does zabbix solve? 11:16 < wakko666> G: custom checks can be one of two things. custom zabbix-agent checks,and zabbix server-side remote checks 11:16 < G> wakko666: oh thats extra nifyt 11:16 < ricky> One thing is combining cacti functionality - what else? 11:16 < G> *nifty 11:16 < G> ricky: distributed monitoring :) 11:17 < ricky> Can you elaborate a bit? :-) 11:17 < G> and has Brett pointed out before, complex checks 11:17 < wakko666> for me, zabbix does templates and rapid configuration of new hosts significantly better than nagios 11:17 < G> errr complex web checks 11:17 < G> yeah, the templating looks _REALLY_ good 11:18 < wakko666> zabbix also is more granular than both cacti and nagios. the default network traffic checks are done every 5 seconds 11:18 < mmcgrath> G: I take it it has similar workflow that nagios has? (not that we usedit?) 11:18 < G> build a profile of the typical application server apply the template to all theapp servers and your home free 11:18 < ricky> Do you have a link where I can see the templating coolness in action? 11:18 < mmcgrath> but outage happens, someone ack's it and starts working? 11:18 < G> mmcgrath: ack etc? yeah 11:18 < f13> darn, I have to leave, but I'm really interested in what platform wins out. Particularly interested in zenoss vs zabbix 11:18 < ricky> Because right now, I'm visualizing hostgroups in nagios 11:18 < wakko666> mmcgrath: yes. same basic workflow 11:19 < wakko666> f13: i vote zabbix over zenoss simply because zabbix doesn't use rpath 11:19 < ricky> f13: zenoss = zope :-( 11:19 < mmcgrath> wakko666: G: how hard is it to script outages? 11:19 < G> that'd be something brett would have to answer 11:20 < f13> wakko666: there is that. 11:20 < f13> ricky: good point. 11:20 < f13> zenoss had something going for it in that previous cacti/nagios stuff would work with it, or so was the claim 11:20 < wakko666> outages are the one thing about zabbix that is a bit unclear to me. i think the best analogue is to disable monitoring (a single drop-down box), or to acknowledgethe alert 11:21 * ricky still hasn't figured out where he can see templates 11:21 < wakko666> being that zabbix doesn't do repeated alerts, you'll only get a single "down" page anyway... 11:21 < mmcgrath> wakko666: as in its difficult to schedule an outage ahead of time? 11:21 < G> ricky: http://publictest3.fedoraproject.org/zabbix/hosts.php?groupid=0&config=2 11:21 < wakko666> ricky: Configuration > Items or Triggers. there's a Template drop-down 11:21 < ricky> Aha 11:22 < wakko666> mmcgrath: yeah, basically. as far as i've seen, zabbix doesn't yet havethe concept of scheduled outages. a service is either up or down, and not much beyond that 11:23 < wakko666> i suspect that may be on their todo list for the next version, though 11:23 < ricky> So where can I see the linkage between a template and the checks for that template? 11:23 < G> I don't think it's an exact issue 11:23 < G> ricky: Items 11:23 < jcollie> you could always shut down the zabbix server :) 11:24 < wakko666> ricky: the expression column will have the template name in it 11:24 < mmcgrath> I've only looked a little bit but... how well does service deps work? 11:24 < wakko666> ricky: err... not expression column... the name column. 11:24 < ricky> I think I got it 11:24 < wakko666> mmcgrath: dependencies are dead easy. 11:24 < G> mmcgrath: it'd appear you can add multiple dependences per trigger 11:25 < G> http://publictest3.fedoraproject.org/zabbix/triggers.php?form=update&trig... 11:25 < wakko666> if you check apache on host A, but that check goes through router B, youadd a dependency on the apache check so that the check doesn't execute unless the checks for router B are passing. 11:28 < mmcgrath> So really 11:29 < mmcgrath> G: how about this... We give it a quick talk tomorrow at the meeting there. If there's no blockers or major opposition. We get it on noc1 and get to work? 11:29 < G> mmcgrath: so your happy with what I've done on pt3 so far? 11:30 < mmcgrath> Yeah so far. I'd like to see it monitoring a couple of things along side nagios, both sending notifications, and see how it does in production. 11:30 < mmcgrath> so not spending a ton of time on it, but monitoring a few critical bits that frequently have problems. 11:30 < G> in that case sure, except if we are putting into production, I guess we should grab Jeff's 0.4.6 update and put it in f-i until it appears in epel 11:31 < G> I'll be happy to lead that task 11:31 < G> wakko666: jcollie: you both in sysadmin-noc? 11:31 < mmcgrath> G: excellent. 11:31 < wakko666> G: applying now. :) 11:31 < G> I'll sponsor you :) 11:32 < wakko666> yay! :-) 11:32 < G> mmcgrath: I think we'll leave the internal authentication for now, I'll leave the main part readable by everyone, and add accounts for everyone in sysadmin-main/noc thatsactive 11:33 < G> wakko666: done 11:33 < mmcgrath> G: thats fine. 11:34 < G> okay, so adjourned until the inframeeting 2000UTC tomorrow :) 11:34 < ricky> How can I trigger a check? 11:34 < wakko666> ricky: turn off the service that it's checking. ;-) 11:34 < ricky> Oh. 11:35 < wakko666> you can also just flip the logic of the trigger. 11:35 -!- G changed the topic of #fedora-meeting to: Channel is used by various Fedora groups and committees for their regular meetings | Note that meetings often get logged | For questions about using Fedora please ask in #fedora | See http://fedoraproject.org/wiki/Communicate/FedoraMeetingChannel for meeting schedule 11:36 < G> I'll post a log to the infra-list soon so people can have a read before the main meeting 11:37 < mmcgrath> G: good ide 11:37 < mmcgrath> a

15 years, 8 months

11
21
0 / 0

← Newer
1
2
3
4
5
6
Older →

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

infrastructure August 2008