Mon, Apr 25, 2016 at 03:28:40PM CEST, olichtne(a)redhat.com wrote:
On Fri, Apr 22, 2016 at 03:07:57PM +0200, Jan Tluka wrote:
> While working on graceful kill I noticed that when we call interrupt on
> a command then the forked child is left orphaned and the pid_exists() check
> always return true and graceful kill timeouts.
>
> Adding join() call after sending the interrupt solves this problem.
>
> Signed-off-by: Jan Tluka <jtluka(a)redhat.com>
> ---
> lnst/Common/NetTestCommand.py | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/lnst/Common/NetTestCommand.py b/lnst/Common/NetTestCommand.py
> index f09e402..6573ffd 100644
> --- a/lnst/Common/NetTestCommand.py
> +++ b/lnst/Common/NetTestCommand.py
> @@ -212,6 +212,7 @@ class NetTestCommand:
> else:
> logging.debug("Interrupting command with id \"%s\",
pid \"%d\"" % (self._id, self._pid))
> os.killpg(os.getpgid(self._pid), signal.SIGINT)
> + self._process.join()
AH! I remember this one!... I've spent 3 weeks looking for this deadlock
bug. Take a look at this commit:
e2dcc4d126b8dee82df9a6f21e66b75905f224c4
This needs to be checked before applying this patch... The commit
message mentions large amounts of data being sent over the communication
PIPE, I'm not exactly sure how much this is at the moment but something
tells me it could be connected to the memory page size of the system.
Also I think we saw this bug when using tcpdump, probably from the
PacketAssert module.
-Ondrej
Yes, the deadlock is there. Scratch this patch set until we find better
solution.
Thanks for catching this, Ondrej!
-Jan
>> self._control_cmd = cmd
>>
>> def kill(self, cmd):
>> --
>> 2.4.11
>> _______________________________________________
>> LNST-developers mailing list
>> lnst-developers(a)lists.fedorahosted.org
>>
https://lists.fedorahosted.org/admin/lists/lnst-developers@lists.fedoraho...