Changes to 'test'

Tuesday, 7 August 2012


New branch 'test' available with the following commits:
commit 2eab3d08b7a5aee49b0fad2234c8ddbd14c47d0e
Author: David Teigland <teigland(a)redhat.com&gt;
Date:   Mon Aug 6 17:18:56 2012 -0500

    wdmd: preemptive close before test fails
    
    Instead of closing the device when a test fails, close
    it TEST_INTERVAL (10 sec) before the test fails.  This
    is done so that the watchdog will fire at most 60 sec
    after the expire time (between 50 and 60 seconds instead
    of between 60 and 70 seconds which would be the case
    if we close at the expiration time; see previous commit).
    
    The timeouts in sanlock have been based on the assumption
    that the watchdog device fires at most 60 seconds after
    the expiration time, so it's best to maintain that
    expectation.
    
    The pre-emptive close and re-open generate pings, so
    they are used in place of ordinary pings.
    
    If the expire time is at T45, and is renewed/extended
    at T46, then the sequence of pings would be:
    
    T10 - ping from ioctl
    T20 - ping from ioctl
    T30 - ping from ioctl
    T40 - ping from close
    T50 - ping from re-open
    T60 - ping from ioctl
    ...
    
    If the expire time was *not* renewed, then the watchdog
    would fire at T100; which is 55 seconds after the
    expiration time.  55 is less than the 60 second limit
    we want.
    
    Signed-off-by: David Teigland <teigland(a)redhat.com&gt;

commit 15ca80d82e619de84a3b365bd6400380a51bc0a3
Author: David Teigland <teigland(a)redhat.com&gt;
Date:   Mon Aug 6 15:38:39 2012 -0500

    wdmd: close device when test fails
    
    Instead of just not petting the device after a test fails,
    close the device.  Because the close generates a ping, we
    want to get it done early, otherwise if wdmd exited (e.g.
    crash or sigkill) just before the device was ready to fire,
    the close generated by the kernel extends the life of the
    machine by an extra 60 sec.  This means we need to re-open
    the device if we want to resume petting it.
    
    So, depending on whether the tests happen just prior
    to the expiry or just after the expiry, the watchdog
    will fire between 60 and 70 seconds after the expiry
    time.
    
    It would be 70 seconds if:
    
    we do the check just before the expiration, the client
    expires, 10 seconds (TEST_INTERVAL) later, we see the
    expiration, close the device, which generates a ping,
    which causes the firing to be 60 seconds after the close,
    which is already 10 seconds after the expiration.
    
    It would be 60 seconds if:
    
    we do the check just after the expiration, we see
    the expiration, close the device, which generates a
    ping, which causes the firing to be 60 seconds after
    the close, which is just after at the expiration
    time.
    
    Previously, the assumption was that the host would
    be reset between 50 and 60 seconds from the expiration
    time, but this did not account for the fact that
    the daemon could exit just before the host reset,
    which would lead the kernel to generate a new ping.
    
    If we can patch the kernel so that a device close
    does not generate a ping, then we do not need to
    close the device when a test fails, but we can
    simply not pet the device, as we've been doing.
    
    Signed-off-by: David Teigland <teigland(a)redhat.com&gt;

commit c92595469f65cffd8807c32abcb2e1af1733e462
Author: David Teigland <teigland(a)redhat.com&gt;
Date:   Tue Jul 31 13:41:28 2012 -0500

    daemon: extend grace time
    
    Increase the default grace time for a killpath instance
    from 30 to 40 seconds based on a corrected analysis of
    the recovery sequence.  The period during which the
    watchdog may fire is determined by the wdmd check
    interval (10 seconds), not the sanlock renewal interval.
    
    Signed-off-by: David Teigland <teigland(a)redhat.com&gt;

commit e7eb4d0fd34fc643213452cb76f5c64fb86063bb
Author: David Teigland <teigland(a)redhat.com&gt;
Date:   Tue Aug 7 15:57:22 2012 -0500

    sanlock/wdmd: remove global connection
    
    As long as the sanlock daemon was running, it kept a
    constant connection to wdmd, even when no lockspaces
    existed.  This prevented sanlock and wdmd from being
    restarted independently, even when they were unused.
    Independent restarting is necessary for upgrades, so
    remove the global connection from sanlock to wdmd and
    leave only the per-lockspace connections.  The lockspace
    connections now need to hold a refcount on wdmd which
    prevents wdmd restarts.
    
    Also in wdmd, if a client connection is closed, the
    refcount must not be cleared on it, otherwise wdmd
    could possibly be cleanly shutdown from sigterm while
    an expired connection was awaiting a watchdog reset.
    
    Also log at the error level when we kill a pid for
    recovery, when that pid exits, and when all pids
    are clear for recovery.  This makes it much simpler
    to see exactly what led up to a watchdog reset
    after the fact.
    
    Signed-off-by: David Teigland <teigland(a)redhat.com&gt;

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011