Hi folks, some thoughts have been slowly coalescing in my head about how we're managing our boxes/services and I have some suggestions I've passed by various folks but I wanted to check them out with everyone:
1. puppetd sucks..... memory. Right now we have puppetd running on every box and it wakes up every half hour and runs itself. This is fine but in the time where it is not doing anything it just eats memory for no good reason. I'd like to suggest we move to a cron-driven model instead of puppetd. I'd write a simple cron job that runs every half hour to run puppetd, if a lock file is not found. Pretty straightforward, of course.
2. monitoring if puppetd has run properly: two things we want to know about puppet runs: a. when they last happened per-box b. if they fell over in a horrible way.
(a) can be known by looking at the $nodename.yaml file which lives on the puppetmaster. I've written a script to check if that file is older than 1 hour and report the nodename if it is. (b) can be done via the cron job - ie: taking error output from the puppet run and mailing to people until we fix it! :)
3. sign** boxes. problems here: a. These boxes are falling out of date, repeatedly, b/c they aren't in our normal updating path. b. these boxes don't email out to the same locations as the other boxes c. these boxes don't get faspassword updates properly d. these boxes don't get config changes normally via puppet
(a) I'd like to suggest that they be put into a normal updating path and/or we setup a nag mail to tell us about them (b) obviously, fix their mail configs (c) fasclient is failing b/c of a missing token b/c, most likely, of (d)
I'm open to suggestions on those but it is a bit annoying b/c while I understand their 'sensitivity' I think our way of treating them is making the problem WORSE not better.
-sv
On Fri, Mar 18, 2011 at 9:04 AM, seth vidal skvidal@fedoraproject.org wrote:
Hi folks, some thoughts have been slowly coalescing in my head about how we're managing our boxes/services and I have some suggestions I've passed by various folks but I wanted to check them out with everyone:
- puppetd sucks..... memory. Right now we have puppetd running on every
box and it wakes up every half hour and runs itself. This is fine but in the time where it is not doing anything it just eats memory for no good reason. I'd like to suggest we move to a cron-driven model instead of puppetd. I'd write a simple cron job that runs every half hour to run puppetd, if a lock file is not found. Pretty straightforward, of course.
I'd be happy to help get this going. I've set up puppet a few times in this fashion now and it's pretty easy to do.
- monitoring if puppetd has run properly:
two things we want to know about puppet runs: a. when they last happened per-box b. if they fell over in a horrible way.
It might be overkill, but puppet dashboard is pretty nice. It's a web interface, kind of like a nagios for puppet, telling you exactly the things you want to know above. Plus, it has some pretty graphs :) It runs on cron jobs too. I've set it up once about a year ago, pretty nice. I'm sure it's improved some since then. Again, I'd be happy to help set this up.
(a) can be known by looking at the $nodename.yaml file which lives on the puppetmaster. I've written a script to check if that file is older than 1 hour and report the nodename if it is. (b) can be done via the cron job - ie: taking error output from the puppet run and mailing to people until we fix it! :)
- sign** boxes. problems here:
a. These boxes are falling out of date, repeatedly, b/c they aren't in our normal updating path. b. these boxes don't email out to the same locations as the other boxes c. these boxes don't get faspassword updates properly d. these boxes don't get config changes normally via puppet
(a) I'd like to suggest that they be put into a normal updating path and/or we setup a nag mail to tell us about them (b) obviously, fix their mail configs (c) fasclient is failing b/c of a missing token b/c, most likely, of (d)
I'm open to suggestions on those but it is a bit annoying b/c while I understand their 'sensitivity' I think our way of treating them is making the problem WORSE not better.
-sv
On Fri, 2011-03-18 at 09:16 -0600, Clint Savage wrote:
On Fri, Mar 18, 2011 at 9:04 AM, seth vidal skvidal@fedoraproject.org wrote:
Hi folks, some thoughts have been slowly coalescing in my head about how we're managing our boxes/services and I have some suggestions I've passed by various folks but I wanted to check them out with everyone:
- puppetd sucks..... memory. Right now we have puppetd running on every
box and it wakes up every half hour and runs itself. This is fine but in the time where it is not doing anything it just eats memory for no good reason. I'd like to suggest we move to a cron-driven model instead of puppetd. I'd write a simple cron job that runs every half hour to run puppetd, if a lock file is not found. Pretty straightforward, of course.
I'd be happy to help get this going. I've set up puppet a few times in this fashion now and it's pretty easy to do.
I have a script now ($RANDOM is my friend) - I was just going to drop it into a path and put a crontab into /etc/cron.d/
is there more to it than that?
- monitoring if puppetd has run properly:
two things we want to know about puppet runs: a. when they last happened per-box b. if they fell over in a horrible way.
It might be overkill, but puppet dashboard is pretty nice. It's a web interface, kind of like a nagios for puppet, telling you exactly the things you want to know above. Plus, it has some pretty graphs :) It runs on cron jobs too. I've set it up once about a year ago, pretty nice. I'm sure it's improved some since then. Again, I'd be happy to help set this up.
feels like overkill. I'm not partial to web interfaces, personally, b/c to get them running and accessible on puppet.fp.o requires a fair bit of hoop jumping and reverse proxying.
-sv
On Fri, Mar 18, 2011 at 9:19 AM, seth vidal skvidal@fedoraproject.org wrote:
On Fri, 2011-03-18 at 09:16 -0600, Clint Savage wrote:
On Fri, Mar 18, 2011 at 9:04 AM, seth vidal skvidal@fedoraproject.org wrote:
Hi folks, some thoughts have been slowly coalescing in my head about how we're managing our boxes/services and I have some suggestions I've passed by various folks but I wanted to check them out with everyone:
- puppetd sucks..... memory. Right now we have puppetd running on every
box and it wakes up every half hour and runs itself. This is fine but in the time where it is not doing anything it just eats memory for no good reason. I'd like to suggest we move to a cron-driven model instead of puppetd. I'd write a simple cron job that runs every half hour to run puppetd, if a lock file is not found. Pretty straightforward, of course.
I'd be happy to help get this going. I've set up puppet a few times in this fashion now and it's pretty easy to do.
I have a script now ($RANDOM is my friend) - I was just going to drop it into a path and put a crontab into /etc/cron.d/
is there more to it than that?
Well, you mentioned that it was puppetd, but I have never run puppetd standalone since you can in fact run puppet 'filename' and it will progress through the same way. Though I do recall you have to have the files locally, so maybe that's why you are using puppetd in this way instead?
You are right though, it's not hard. I was just trying to keep my chops up with puppet and thought you guys could use a hand here so you can move on to cooler, more important problems :)
- monitoring if puppetd has run properly:
two things we want to know about puppet runs: a. when they last happened per-box b. if they fell over in a horrible way.
It might be overkill, but puppet dashboard is pretty nice. It's a web interface, kind of like a nagios for puppet, telling you exactly the things you want to know above. Plus, it has some pretty graphs :) It runs on cron jobs too. I've set it up once about a year ago, pretty nice. I'm sure it's improved some since then. Again, I'd be happy to help set this up.
feels like overkill. I'm not partial to web interfaces, personally, b/c to get them running and accessible on puppet.fp.o requires a fair bit of hoop jumping and reverse proxying.
-sv
I debated even suggesting it. I kind of thought it might be a pain in that fashion, but I know that it generates some interesting detail. If I remember correctly, it just grabs the .yaml file so maybe that's good enough to do with the script you mention above. I'll put some other brainwaves into it today, but if nothing comes of it, I'd be happy to help here too.
Clint
Clint Savage wrote:
It might be overkill, but puppet dashboard is pretty nice. It's a web interface, kind of like a nagios for puppet, telling you exactly the things you want to know above. Plus, it has some pretty graphs :) It runs on cron jobs too. I've set it up once about a year ago, pretty nice. I'm sure it's improved some since then. Again, I'd be happy to help set this up.
That's going to run afoul of infrastructure's desire to only run tools that are acceptable for packaging in Fedora/EPEL AFAIK. As with almost any other rails application, puppet-dashboard bundles a number of ruby gems that make it unacceptable -- last I looked anyway.
seth vidal wrote:
- puppetd sucks..... memory.
Indeed it does. This seems to be worse on the older ruby that is shipped in RHEL-5 than in Fedora, but it's still a non-trivial amount of ram. A lot of folks run puppet from cron for this reason.
I'd like to suggest we move to a cron-driven model instead of puppetd. I'd write a simple cron job that runs every half hour to run puppetd, if a lock file is not found. Pretty straightforward, of course.
You can probably avoid mucking with lockfiles yourself and just call puppetd. It will use /var/lib/puppet/state/puppetdlock by default. This is the same file that puppetd --disable writes. So if puppet is disabled or already running, you'll just get the message that puppet is skipping this run.
(I'm preparing puppet-2.6.6 for fedora and epel -testing repos now, coincidentally.)
On Fri, 18 Mar 2011 11:04:32 -0400 seth vidal skvidal@fedoraproject.org wrote:
Hi folks, some thoughts have been slowly coalescing in my head about how we're managing our boxes/services and I have some suggestions I've passed by various folks but I wanted to check them out with everyone:
- puppetd sucks..... memory. Right now we have puppetd running on
every box and it wakes up every half hour and runs itself. This is fine but in the time where it is not doing anything it just eats memory for no good reason. I'd like to suggest we move to a cron-driven model instead of puppetd. I'd write a simple cron job that runs every half hour to run puppetd, if a lock file is not found. Pretty straightforward, of course.
I think this is a fine idea. ;)
monitoring if puppetd has run properly: two things we want to know about puppet runs: a. when they last happened per-box b. if they fell over in a horrible way.
(a) can be known by looking at the $nodename.yaml file which lives
on the puppetmaster. I've written a script to check if that file is older than 1 hour and report the nodename if it is. (b) can be done via the cron job - ie: taking error output from the puppet run and mailing to people until we fix it! :)
Sounds good. There are some few boxes where we don't run puppet, (the sign* boxes, some of the backup boxes?)
Options here:
1) if we don't intend to puppet manage them, perhaps we should completely disable them/comment them out for normal operations? I know the sign* machines puppet module is intended to setup everything needed on those machines with a blank db and ready to configure. So, we would only be using this in setting up new instances. Disable the rest of the time.
2) Fix the puppet modules on them so that for normal operations they only do a small number of things... fasClient,etc. I think this is not intended however for security reasons.
- sign** boxes. problems here: a. These boxes are falling out of date, repeatedly, b/c they aren't
in our normal updating path. b. these boxes don't email out to the same locations as the other boxes c. these boxes don't get faspassword updates properly d. these boxes don't get config changes normally via puppet
(a) I'd like to suggest that they be put into a normal updating path and/or we setup a nag mail to tell us about them (b) obviously, fix their mail configs (c) fasclient is failing b/c of a missing token b/c, most likely, of (d)
I'm open to suggestions on those but it is a bit annoying b/c while I understand their 'sensitivity' I think our way of treating them is making the problem WORSE not better.
a) I'd agree. nag mail on updates might be the easy path. b) yep c) Perhaps we should just make them non fas accounts there? Like backup? d) we either need to fix the puppet module to not tamper with any db stuff in normal operations, or not use puppet on them except to setup initial config.
I know one of the things I was going to look at doing was making a new sign-{bridge|vault} pair with puppet and see what all it did and if it got everything setup, etc.
So, short term, I would say we should apply updates, fix mail, setup nag mail for updates, and fix fasclient and leave the puppet issue for later after we look at what all is going on in that module.
kevin
On Fri, Mar 18, 2011 at 11:04:32AM -0400, seth vidal wrote:
Hi folks, some thoughts have been slowly coalescing in my head about how we're managing our boxes/services and I have some suggestions I've passed by various folks but I wanted to check them out with everyone:
- puppetd sucks..... memory. Right now we have puppetd running on every
box and it wakes up every half hour and runs itself. This is fine but in the time where it is not doing anything it just eats memory for no good reason. I'd like to suggest we move to a cron-driven model instead of puppetd. I'd write a simple cron job that runs every half hour to run puppetd, if a lock file is not found. Pretty straightforward, of course.
+1
Might need to update kickstarts and/or the SOP pages:
http://fedoraproject.org/wiki/Kickstart_Infrastructure_SOP http://fedoraproject.org/wiki/Puppet_Infrastructure_SOP
monitoring if puppetd has run properly: two things we want to know about puppet runs: a. when they last happened per-box b. if they fell over in a horrible way.
(a) can be known by looking at the $nodename.yaml file which lives
on the puppetmaster. I've written a script to check if that file is older than 1 hour and report the nodename if it is. (b) can be done via the cron job - ie: taking error output from the puppet run and mailing to people until we fix it! :)
+1
- sign** boxes. problems here: a. These boxes are falling out of date, repeatedly, b/c they aren't
in our normal updating path. b. these boxes don't email out to the same locations as the other boxes c. these boxes don't get faspassword updates properly d. these boxes don't get config changes normally via puppet
(a) I'd like to suggest that they be put into a normal updating path and/or we setup a nag mail to tell us about them (b) obviously, fix their mail configs (c) fasclient is failing b/c of a missing token b/c, most likely, of (d)
I'm open to suggestions on those but it is a bit annoying b/c while I understand their 'sensitivity' I think our way of treating them is making the problem WORSE not better.
I agree with your assessment. I guess we need to tell releng our concerns and figure out what needs to be done For a: perhaps have releng okay us/a specific subset of sysadmins to run updates along with all the other updates.
-Toshio
Hello Folks,
I wrote a monitoring script[1] for check_mk[2], which is capable of integrating "puppetstatus" and a last-run check, and shows whether puppet is disabled (also the reason for disabling it) or when the agent was last run.
It cannot check for what puppet did and whether the catalog compiled successfully, I'd +1 for puppet-dashboard here.
Speaking of enhancing things, I'd as well like to suggest switching local-host checks from NRPE to check_mk via SSH.
There are certain benefits:
• Automatic inventory of checks per host • For each host, Nagios only triggers only active check any more, which returns all data needed to feed the other, passive checks. • All other checks just use cached data from the active check. • check_mk also extracts performance data and can directly insert that into round robin databases.
Regards, Stefan.
[1] - https://github.com/sts/checkmk/tree/master/puppet [2] - http://mathias-kettner.de/check_mk.html
On Mar 18, 2011, at 17:28 , Toshio Kuratomi wrote:
On Fri, Mar 18, 2011 at 11:04:32AM -0400, seth vidal wrote:
Hi folks, some thoughts have been slowly coalescing in my head about how we're managing our boxes/services and I have some suggestions I've passed by various folks but I wanted to check them out with everyone:
- puppetd sucks..... memory. Right now we have puppetd running on every
box and it wakes up every half hour and runs itself. This is fine but in the time where it is not doing anything it just eats memory for no good reason. I'd like to suggest we move to a cron-driven model instead of puppetd. I'd write a simple cron job that runs every half hour to run puppetd, if a lock file is not found. Pretty straightforward, of course.
+1
Might need to update kickstarts and/or the SOP pages:
http://fedoraproject.org/wiki/Kickstart_Infrastructure_SOP http://fedoraproject.org/wiki/Puppet_Infrastructure_SOP
- monitoring if puppetd has run properly:
two things we want to know about puppet runs: a. when they last happened per-box b. if they fell over in a horrible way.
(a) can be known by looking at the $nodename.yaml file which lives on the puppetmaster. I've written a script to check if that file is older than 1 hour and report the nodename if it is. (b) can be done via the cron job - ie: taking error output from the puppet run and mailing to people until we fix it! :)
+1
- sign** boxes. problems here:
a. These boxes are falling out of date, repeatedly, b/c they aren't in our normal updating path. b. these boxes don't email out to the same locations as the other boxes c. these boxes don't get faspassword updates properly d. these boxes don't get config changes normally via puppet
(a) I'd like to suggest that they be put into a normal updating path and/or we setup a nag mail to tell us about them (b) obviously, fix their mail configs (c) fasclient is failing b/c of a missing token b/c, most likely, of (d)
I'm open to suggestions on those but it is a bit annoying b/c while I understand their 'sensitivity' I think our way of treating them is making the problem WORSE not better.
I agree with your assessment. I guess we need to tell releng our concerns and figure out what needs to be done For a: perhaps have releng okay us/a specific subset of sysadmins to run updates along with all the other updates.
-Toshio _______________________________________________ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
-- Stefan Schlesinger ////////////////////////////////////////// /////// sts@ono.at
On Mon, 21 Mar 2011 11:42:58 +0100 Stefan Schlesinger sts@ono.at wrote:
Hello Folks,
I wrote a monitoring script[1] for check_mk[2], which is capable of integrating "puppetstatus" and a last-run check, and shows whether puppet is disabled (also the reason for disabling it) or when the agent was last run.
Cool.
check_mk looks interesting... but it seems it's not in Fedora/EPEL yet?
It cannot check for what puppet did and whether the catalog compiled successfully, I'd +1 for puppet-dashboard here.
As noted, puppet-dashboard isn't going to happen I don't think.
Speaking of enhancing things, I'd as well like to suggest switching local-host checks from NRPE to check_mk via SSH.
There are certain benefits:
• Automatic inventory of checks per host • For each host, Nagios only triggers only active check any more, which returns all data needed to feed the other, passive checks. • All other checks just use cached data from the active check. • check_mk also extracts performance data and can directly insert that into round robin databases.
Yeah, looks interesting. Someone should get it into EPEL and we can evaluate it. ;)
Regards, Stefan.
kevin
On Fri, 2011-03-18 at 09:28 -0700, Toshio Kuratomi wrote:
+1
Might need to update kickstarts and/or the SOP pages:
http://fedoraproject.org/wiki/Kickstart_Infrastructure_SOP http://fedoraproject.org/wiki/Puppet_Infrastructure_SOP
good point. done.
-sv
On Fri, 2011-03-18 at 11:04 -0400, seth vidal wrote:
Hi folks, some thoughts have been slowly coalescing in my head about how we're managing our boxes/services and I have some suggestions I've passed by various folks but I wanted to check them out with everyone:
- puppetd sucks..... memory. Right now we have puppetd running on every
box and it wakes up every half hour and runs itself. This is fine but in the time where it is not doing anything it just eats memory for no good reason. I'd like to suggest we move to a cron-driven model instead of puppetd. I'd write a simple cron job that runs every half hour to run puppetd, if a lock file is not found. Pretty straightforward, of course.
this is done.
monitoring if puppetd has run properly: two things we want to know about puppet runs: a. when they last happened per-box b. if they fell over in a horrible way.
(a) can be known by looking at the $nodename.yaml file which lives
on the puppetmaster. I've written a script to check if that file is older than 1 hour and report the nodename if it is. (b) can be done via the cron job - ie: taking error output from the puppet run and mailing to people until we fix it! :)
I've written this and it can now submit issues via nsca (via func). One problem it appears our puppet node names do not match our nagios host names, A LOT. So we'll need to get some aliases in place so they work.
-sv
infrastructure@lists.fedoraproject.org