On Thu, 2008-06-26 at 17:16 -0400, Dan Williams wrote:
On Thu, 2008-06-26 at 15:41 -0500, Jason L Tibbitts III wrote:
> >>>>> "JB" == Josh Boyer <jwboyer(a)gmail.com> writes:
>
> JB> That might have had a bigger effect. I though koji would only run
> JB> one build job per builder? Or is it per CPU?
>
> I don't know what koji does, but in this case koji was unaware that
> the jobs were still running. I guess they had been killed from the
> server but not cleaned up on the builders.
This happened a lot with plague too. I think it's Just Hard in *NIX to
ensure that all ancestors of a given task have been killed dead dead
dead. Maybe they somehow get out of the parent's process group, they
are just hung and don't respond to signals, they are in D state when the
signals get sent, whatever. Running craploads of scripts and programs
as part of the build process that fork and exec and do God-knows-what
doesn't lend itself to being cleaned up easily.
I think either cgroups (?) or putting each build in a clean VM which can
be torn down completely is probably the answer. And out of those two, a
whole new VM would be pretty heavy to create/destroy so it's probably
out of the question.
And impossible on ppc without LPAR support for every builder.
Hopefully containers will help when the builders move to RHEL6 (if RHEL6
supports containers...).
josh