New branch 'test' available with the following commits:
commit 2eab3d08b7a5aee49b0fad2234c8ddbd14c47d0e
Author: David Teigland <teigland(a)redhat.com>
Date: Mon Aug 6 17:18:56 2012 -0500
wdmd: preemptive close before test fails
Instead of closing the device when a test fails, close
it TEST_INTERVAL (10 sec) before the test fails. This
is done so that the watchdog will fire at most 60 sec
after the expire time (between 50 and 60 seconds instead
of between 60 and 70 seconds which would be the case
if we close at the expiration time; see previous commit).
The timeouts in sanlock have been based on the assumption
that the watchdog device fires at most 60 seconds after
the expiration time, so it's best to maintain that
expectation.
The pre-emptive close and re-open generate pings, so
they are used in place of ordinary pings.
If the expire time is at T45, and is renewed/extended
at T46, then the sequence of pings would be:
T10 - ping from ioctl
T20 - ping from ioctl
T30 - ping from ioctl
T40 - ping from close
T50 - ping from re-open
T60 - ping from ioctl
...
If the expire time was *not* renewed, then the watchdog
would fire at T100; which is 55 seconds after the
expiration time. 55 is less than the 60 second limit
we want.
Signed-off-by: David Teigland <teigland(a)redhat.com>
commit 15ca80d82e619de84a3b365bd6400380a51bc0a3
Author: David Teigland <teigland(a)redhat.com>
Date: Mon Aug 6 15:38:39 2012 -0500
wdmd: close device when test fails
Instead of just not petting the device after a test fails,
close the device. Because the close generates a ping, we
want to get it done early, otherwise if wdmd exited (e.g.
crash or sigkill) just before the device was ready to fire,
the close generated by the kernel extends the life of the
machine by an extra 60 sec. This means we need to re-open
the device if we want to resume petting it.
So, depending on whether the tests happen just prior
to the expiry or just after the expiry, the watchdog
will fire between 60 and 70 seconds after the expiry
time.
It would be 70 seconds if:
we do the check just before the expiration, the client
expires, 10 seconds (TEST_INTERVAL) later, we see the
expiration, close the device, which generates a ping,
which causes the firing to be 60 seconds after the close,
which is already 10 seconds after the expiration.
It would be 60 seconds if:
we do the check just after the expiration, we see
the expiration, close the device, which generates a
ping, which causes the firing to be 60 seconds after
the close, which is just after at the expiration
time.
Previously, the assumption was that the host would
be reset between 50 and 60 seconds from the expiration
time, but this did not account for the fact that
the daemon could exit just before the host reset,
which would lead the kernel to generate a new ping.
If we can patch the kernel so that a device close
does not generate a ping, then we do not need to
close the device when a test fails, but we can
simply not pet the device, as we've been doing.
Signed-off-by: David Teigland <teigland(a)redhat.com>
commit c92595469f65cffd8807c32abcb2e1af1733e462
Author: David Teigland <teigland(a)redhat.com>
Date: Tue Jul 31 13:41:28 2012 -0500
daemon: extend grace time
Increase the default grace time for a killpath instance
from 30 to 40 seconds based on a corrected analysis of
the recovery sequence. The period during which the
watchdog may fire is determined by the wdmd check
interval (10 seconds), not the sanlock renewal interval.
Signed-off-by: David Teigland <teigland(a)redhat.com>
commit e7eb4d0fd34fc643213452cb76f5c64fb86063bb
Author: David Teigland <teigland(a)redhat.com>
Date: Tue Aug 7 15:57:22 2012 -0500
sanlock/wdmd: remove global connection
As long as the sanlock daemon was running, it kept a
constant connection to wdmd, even when no lockspaces
existed. This prevented sanlock and wdmd from being
restarted independently, even when they were unused.
Independent restarting is necessary for upgrades, so
remove the global connection from sanlock to wdmd and
leave only the per-lockspace connections. The lockspace
connections now need to hold a refcount on wdmd which
prevents wdmd restarts.
Also in wdmd, if a client connection is closed, the
refcount must not be cleared on it, otherwise wdmd
could possibly be cleanly shutdown from sigterm while
an expired connection was awaiting a watchdog reset.
Also log at the error level when we kill a pid for
recovery, when that pid exits, and when all pids
are clear for recovery. This makes it much simpler
to see exactly what led up to a watchdog reset
after the fact.
Signed-off-by: David Teigland <teigland(a)redhat.com>
Show replies by date