Hi,
the semantics of timeouts should be pretty clear - we have them to
ensure that the whole application doesn't hang when a command hangs
because of an unexpected problem.
I guess most of the confusion is directly related to the xml timeout
attribute, when used for background commands, and what it can or should
mean. As you wrote there multiple scenarios here, but I'll divide them
into 3:
1. a background command is associated with a 'wait' command. This case is
simple and works correctly as explained.
2. the bg command is associated with a 'kill' command. This case is also
very simple, but can be looked at from 2 viewpoints. We don't care
if the bg cmd times out and just kill it, or we kill it when it
times out. Kill itself uses the default timeout, in case the
connection breaks. Currently we implement the first option and I
think this is correct, however, this is where the confusion for
users start. What does the timeout attribute for the bg cmd mean if
it's not used by kill or the command itself? I think it would make
sense to just not allow the user to specify a timeout attribute in
this case since it's ignored.
3. the bg cmd is associated with an 'intr' command. This is similar to
the 'kill case'. The difference here is that interrupting the
command gives it time to process and return results. But what if the
command hangs during that? Should we use the timeout attribute of
the bg command or rely on the default timeout used by the intr
command? We currently implement the second option, but again here
comes the confusion for users - what does the timeout attribute for
the bg command mean then?
So the problem is this: kill and intr commands are currently implemented
in such a way that the timeout attribute for bg command is
ignored/doesn't make sense and should probably be removed or forbidden
to avoid confusion. Unfortunately this is not possible since we don't
know if the bg command is associated with a wait, intr or kill command
while parsing the command and we still want timeout attributes when wait
is used.
One possible solution is to move the timeout attribute into the wait
command and remove it from the bg commands. This is how it would look
like:
<run command="xyz" bg_id="1"/>
... other commands ...
<wait bg_id="1" timeout="100"/>
Here the timeout can mean two things:
* maximum time I will wait for the command to return, since the start of
the wait command
* maximum time I will wait for the command to return, since the start of
the background command (equivalent to what is currently implemented)
And for intr/kill:
<run command="xyz" bg_id="1"/>
... other commands ...
<intr bg_id="1"/>
<run command="xyz" bg_id="2"/>
... other commands ...
<kill bg_id="2"/>
Which is implemented the same way as now - just the default timeout in
case something breaks during the intr/kill execution.
Sorry if I repeated something that you already explained, I just tried
writing down my thought process in a way that makes sense.
-Ondrej
On Tue, Mar 10, 2015 at 04:26:05PM +0100, Jan Tluka wrote:
Hi everyone,
while discussing recent signal handling issues in LNST we also came
across timeout handling that seems a bit non clear atm.
Any command or test has timeout attribute that defines how much time
it takes for command to finish. E.g.
<run command="sleep 9" timeout="10"/>
Internally we use SIGALRM, to notify the controller about "time is up".
The simplest case is a command/test that's run in foreground (not having
bg_id attribute). For such command the SIGALRM handler is set to raise
an exception and SIGALRM is scheduled based on timeout attribute. If the
command does not finish on time it will be killed.
Another case is a command that is put in background. In that case the
scheduled SIGALRM is immediately reset. Therefore any timeout set for
the command in background does not make sense since the SIGALRM is not
scheduled anymore (actually it will be scheduled again since there are other
commands in the queue, but now irrelevant).
For the background commands there are two variants of scenario:
a. intr()/kill() is called, here we actually don't care about command
timeout, since recipe is written in a way to terminate the process in
background
b. wait() is called, in this case, the SIGALRM is scheduled for the wait
command based on remaining time of the command running in background,
so in this case timeout is properly handled and valid
All of the above is kind of summary and things are working reliably but
I'd still like to start a discussion on the whole subject and also
following.
If a user specifies timeout for command on background and it has intr() or
kill() accompanied, should we restrict it? Should we just notify the user
that the timeout specified does not make any difference? Should we implement
it so that timeout works also in this case?
Ondrej also mentioned, that if the bg command runs for too long it can be
killed because of socket timeout. So it might be unexpected behavior.
Let me know what you think or if you have more ideas.
-Jan
_______________________________________________
LNST-developers mailing list
LNST-developers(a)lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/lnst-developers