This patch set enhances handling of command timeouts. We've noticed that for example netperf ran from Netperf test module might not be able to make connection to netperf server. Linux default is to try send 6 SYN packets before giving up TCP connection. This can take up to 2 minutes before netperf terminates.
When the timeout for Netperf module happens the current approach is simply to send SIG_KILL to the process. It works fine but this also make all of command outputs to be lost so the user can't tell what happened to netperf besides that it was killed.
To overcome this limitation I've added graceful kill when timeout occurs. First the slave tries to send SIG_INT to the process and checks if the process ended for 5 seconds. If the process does not end it is SIG_KILLed.
I tried to test this as much as possible and I also ran the regression tests. Two of the tests had to be modified due to new reporting of graceful kill. Besides that all is working fine and even better.
Jan Tluka (7): NetTestCommand: add pid_exists method to NetTestCommand NetTestCommand: log interrupt of foreground and background command separately NetTestCommand: added graceful kill flag NetTestSlave: add graceful termination to kill_command Machine: use graceful kill_command on process timeout NetTestCommand: add missing join on interrupt regression-tests: update tests to match graceful termination on timeout
lnst/Common/NetTestCommand.py | 26 +++++++++++++++++++++++--- lnst/Controller/Machine.py | 9 +++++++-- lnst/Slave/NetTestSlave.py | 22 ++++++++++++++++++++-- regression-tests/tests/24/run.sh | 2 +- regression-tests/tests/27/run.sh | 4 ++-- 5 files changed, 53 insertions(+), 10 deletions(-)
Signed-off-by: Jan Tluka jtluka@redhat.com --- lnst/Common/NetTestCommand.py | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/lnst/Common/NetTestCommand.py b/lnst/Common/NetTestCommand.py index 774c26e..b64186d 100644 --- a/lnst/Common/NetTestCommand.py +++ b/lnst/Common/NetTestCommand.py @@ -188,15 +188,18 @@ class NetTestCommand: self._finished = True self._control_cmd = cmd
+ def pid_exists(self): + return os.path.exists("/proc/%d" % self._pid) + def interrupt(self, cmd): self._finished = True - if os.path.exists("/proc/%d" % self._pid): + if self.pid_exists(): logging.debug("Interrupting background command with id "%s", pid "%d"" % (self._id, self._pid)) os.killpg(os.getpgid(self._pid), signal.SIGINT) self._control_cmd = cmd
def kill(self, cmd): - if os.path.exists("/proc/%d" % self._pid): + if self.pid_exists(): if self._id: logging.debug("Killing background command with id "%s", pid "%d"" % (self._id, self._pid)) else:
Signed-off-by: Jan Tluka jtluka@redhat.com
# Conflicts: # lnst/Common/NetTestCommand.py
Signed-off-by: Jan Tluka jtluka@redhat.com --- lnst/Common/NetTestCommand.py | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/lnst/Common/NetTestCommand.py b/lnst/Common/NetTestCommand.py index b64186d..3f11f92 100644 --- a/lnst/Common/NetTestCommand.py +++ b/lnst/Common/NetTestCommand.py @@ -194,7 +194,10 @@ class NetTestCommand: def interrupt(self, cmd): self._finished = True if self.pid_exists(): - logging.debug("Interrupting background command with id "%s", pid "%d"" % (self._id, self._pid)) + if self._id: + logging.debug("Interrupting background command with id "%s", pid "%d"" % (self._id, self._pid)) + else: + logging.debug("Interrupting command with id "%s", pid "%d"" % (self._id, self._pid)) os.killpg(os.getpgid(self._pid), signal.SIGINT) self._control_cmd = cmd
Fri, Apr 22, 2016 at 02:45:57PM CEST, jtluka@redhat.com wrote:
Signed-off-by: Jan Tluka jtluka@redhat.com
# Conflicts: # lnst/Common/NetTestCommand.py
OMG, is it Friday? I'm going to check this, I mean the patch not Friday.
Sorry for noise.
Fri, Apr 22, 2016 at 02:50:35PM CEST, jtluka@redhat.com wrote:
Fri, Apr 22, 2016 at 02:45:57PM CEST, jtluka@redhat.com wrote:
Signed-off-by: Jan Tluka jtluka@redhat.com
# Conflicts: # lnst/Common/NetTestCommand.py
OMG, is it Friday? I'm going to check this, I mean the patch not Friday.
Sorry for noise.
Ok, so it is a relict of conflict that has been resolved.
Anyway I'm sending v3.
-Jan
This flag indicates whether it was possible to kill the command gracefully. Since we're using forked processes the value must be shared between parent and child. I used multiprocessing.Value for this purpose.
The flag is used to mark the result to be sent to controller with the indication of this kind of interruption.
Signed-off-by: Jan Tluka jtluka@redhat.com --- lnst/Common/NetTestCommand.py | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/lnst/Common/NetTestCommand.py b/lnst/Common/NetTestCommand.py index 3f11f92..f09e402 100644 --- a/lnst/Common/NetTestCommand.py +++ b/lnst/Common/NetTestCommand.py @@ -100,6 +100,7 @@ class NetTestCommand: self._log_ctl = log_ctl self._start_time = None self._result_sent = False + self._gracefully_killed = multiprocessing.Value('i', 0)
if "bg_id" not in self._command: self._id = None @@ -177,9 +178,21 @@ class NetTestCommand: result["cmd_id"] = self._id result["result"] = res_data
+ if self.get_gracefully_killed() == 1: + result["result"]["gracefully_killed"] = True + send_data(self._write_pipe, result) self._write_pipe.close()
+ def set_gracefully_killed(self, value=True): + if value: + self._gracefully_killed.value = 1 + else: + self._gracefully_killed.value = 0 + + def get_gracefully_killed(self): + return self._gracefully_killed.value + def join(self): self._process.join()
I added graceful parameter to kill_command. When it is set to True the slave first tries to interrupt the command using SIG_INT. Then it waits 5 seconds and checks if the process is still alive using pid_exists(). When the command does not terminate after this period it is killed by SIG_KILL.
This is basically needed if we want to keep the output of a command that timeouts. When killed (not interrupted) the command's output is lost and never logged.
V2: place reset of the graceful flag in graceful branch
Signed-off-by: Jan Tluka jtluka@redhat.com --- lnst/Slave/NetTestSlave.py | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/lnst/Slave/NetTestSlave.py b/lnst/Slave/NetTestSlave.py index 3b8bf79..d8ad158 100644 --- a/lnst/Slave/NetTestSlave.py +++ b/lnst/Slave/NetTestSlave.py @@ -439,11 +439,29 @@ class SlaveMethods:
return res
- def kill_command(self, id): + def kill_command(self, id, graceful=False): cmd = self._command_context.get_cmd(id) if cmd is not None: if not cmd.get_result_sent(): - cmd.kill(None) + if graceful: + check=0 + cmd.set_gracefully_killed() + cmd.interrupt(None) + while check < 5: + sleep(1) + if not cmd.pid_exists(): + break + check += 1 + + if cmd.pid_exists(): + cmd.set_gracefully_killed(False) + logging.debug("Graceful termination timeout, killing") + cmd.kill(None) + else: + return None + else: + cmd.kill(None) + result = cmd.get_result() cmd.set_result_sent() return result
The controller now uses graceful killing of a command that timeouts. This is needed if we want to keep the output of the command.
Also we need to set fail flag on the result in that case. The reason is that for example IcmpPing module would report that it has passed based on processing data of interrupted ping command within the module.
Signed-off-by: Jan Tluka jtluka@redhat.com --- lnst/Controller/Machine.py | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/lnst/Controller/Machine.py b/lnst/Controller/Machine.py index 5d8db7d..da63197 100644 --- a/lnst/Controller/Machine.py +++ b/lnst/Controller/Machine.py @@ -330,15 +330,20 @@ class Machine(object): except MachineError as exc: if "proc_id" in command: cmd_res = self._rpc_call_x(netns, "kill_command", - command["proc_id"]) + command["proc_id"], True) else: cmd_res = self._rpc_call_x(netns, "kill_command", - None) + None, True)
if "killed" in cmd_res and cmd_res["killed"]: cmd_res["passed"] = False cmd_res["msg"] = str(exc)
+ if "gracefully_killed" in cmd_res and \ + cmd_res["gracefully_killed"]: + cmd_res["passed"] = False + cmd_res["msg"] = "Command gracefully killed on timeout" + signal.alarm(0) signal.signal(signal.SIGALRM, prev_handler)
While working on graceful kill I noticed that when we call interrupt on a command then the forked child is left orphaned and the pid_exists() check always return true and graceful kill timeouts.
Adding join() call after sending the interrupt solves this problem.
Signed-off-by: Jan Tluka jtluka@redhat.com --- lnst/Common/NetTestCommand.py | 1 + 1 file changed, 1 insertion(+)
diff --git a/lnst/Common/NetTestCommand.py b/lnst/Common/NetTestCommand.py index f09e402..6573ffd 100644 --- a/lnst/Common/NetTestCommand.py +++ b/lnst/Common/NetTestCommand.py @@ -212,6 +212,7 @@ class NetTestCommand: else: logging.debug("Interrupting command with id "%s", pid "%d"" % (self._id, self._pid)) os.killpg(os.getpgid(self._pid), signal.SIGINT) + self._process.join() self._control_cmd = cmd
def kill(self, cmd):
Signed-off-by: Jan Tluka jtluka@redhat.com --- regression-tests/tests/24/run.sh | 2 +- regression-tests/tests/27/run.sh | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/regression-tests/tests/24/run.sh b/regression-tests/tests/24/run.sh index 1aff4a3..0ec860d 100755 --- a/regression-tests/tests/24/run.sh +++ b/regression-tests/tests/24/run.sh @@ -14,7 +14,7 @@ log2=`cat test.log`
print_separator assert_status "fail" "$rv1" -assert_log "INFO" "RPC connection to machine \w* timed out" "$log1" +assert_log "INFO" "Command gracefully killed on timeout" "$log1" assert_status "pass" "$rv2"
rm -f test.log diff --git a/regression-tests/tests/27/run.sh b/regression-tests/tests/27/run.sh index e61da3f..e5acfeb 100755 --- a/regression-tests/tests/27/run.sh +++ b/regression-tests/tests/27/run.sh @@ -36,12 +36,12 @@ print_separator assert_status "pass" "$rv1" assert_status "pass" "$rv2" assert_status "fail" "$rv3" -assert_log "INFO" "RPC connection to machine testmachine1 timed out" "$log3" +assert_log "INFO" "Command gracefully killed on timeout" "$log3" assert_status "pass" "$rv4" assert_status "pass" "$rv5" assert_status "pass" "$rv6" assert_status "fail" "$rv7" -assert_log "INFO" "RPC connection to machine testmachine1 timed out" "$log7" +assert_log "INFO" "Command gracefully killed on timeout" "$log7"
rm -f test.log
lnst-developers@lists.fedorahosted.org