On Thu, 2008-06-26 at 15:41 -0500, Jason L Tibbitts III wrote:
>>>>> "JB" == Josh Boyer
<jwboyer(a)gmail.com> writes:
JB> That might have had a bigger effect. I though koji would only run
JB> one build job per builder? Or is it per CPU?
I don't know what koji does, but in this case koji was unaware that
the jobs were still running. I guess they had been killed from the
server but not cleaned up on the builders.
This happened a lot with plague too. I think it's Just Hard in *NIX to
ensure that all ancestors of a given task have been killed dead dead
dead. Maybe they somehow get out of the parent's process group, they
are just hung and don't respond to signals, they are in D state when the
signals get sent, whatever. Running craploads of scripts and programs
as part of the build process that fork and exec and do God-knows-what
doesn't lend itself to being cleaned up easily.
I think either cgroups (?) or putting each build in a clean VM which can
be torn down completely is probably the answer. And out of those two, a
whole new VM would be pretty heavy to create/destroy so it's probably
out of the question.
Dan