Hello Adam Litke,
I'd like you to do a code review. Please visit
https://gerrit.ovirt.org/46485
to review the following change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
safelease: Increase spmprotect timeouts
When spmprotect.sh fail to renew the lease, it start a fencing process:
1. Send SIGUSR1 signal to vdsm 2. Send SIGTERM signal to vdsm after 7 seconds 3. Send SIGKILL signal to vdsm after 9 seconds 4. Reboot the machine after 20 seconds
When vdsm receives the SIGUSR1 signal, the signal handler invokes stopSpm, which releases the cluster lock. Releasing the cluster lock will terminate the waiting spmprotect.sh, preventing termination of vdsm and reboot.
If vdsm fails to release the cluster lock within 7 seconds, spmprotect.sh will terminate it, and if did not terminate, spmprotect.sh will kill it.
When systemd starts vdsm again, vdsm looks for spmprotect.sh processes and tries to release the lease. If the lease cannot be released after 10 seconds, it kills the pending spmprotect.sh processes, preventing reboot.
Testing with both block and file storage show that this flow is broken when access to master domain is blocked:
1. In block storage, vdsm gets stuck trying to unmount the master mount, and spmprotect.sh kills it before it try to release the cluster lock.
2. In file storage, vdsm gets stuck trying to write spm status to the master domain, and spmprotect.sh kills it before it try to release the cluster lock.
3. When vdsm starts up, sometimes it manage to kill the waiting spmprotect.sh process, and sometimes spmprotect.sh reboot the machine before vdsm kills it.
We cannot fix 1 and 2 easily. 3 can be fixed by giving vdsm more time for stopSpm flow, and more time to startup and kill pending spmprotect.sh process.
This patch increases spmprotect timeouts to increase the chance of clean shutdown and decrease the chance of unneeded reboot.
New spmprotect.sh flow is:
1. Send SIGUSR1 signal to vdsm 2. Send SIGTERM signal to vdsm after 10 seconds 3. Send SIGKILL signal to vdsm after 20 seconds 4. Reboot the machine after 60 seconds
Change-Id: Ib71fa06c21602fd9d43516c5b4c997c481708697 Bug-Url: https://bugzilla.redhat.com/1222564 Signed-off-by: Nir Soffer nsoffer@redhat.com Reviewed-on: https://gerrit.ovirt.org/46057 Continuous-Integration: Jenkins CI Reviewed-by: Adam Litke alitke@redhat.com --- M vdsm/storage/protect/spmprotect.sh 1 file changed, 3 insertions(+), 3 deletions(-)
git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/85/46485/1
diff --git a/vdsm/storage/protect/spmprotect.sh b/vdsm/storage/protect/spmprotect.sh index fc0e390..b13b07c 100755 --- a/vdsm/storage/protect/spmprotect.sh +++ b/vdsm/storage/protect/spmprotect.sh @@ -71,12 +71,12 @@ trap "" INT
log "Fencing sdUUID=$sdUUID id=$ID lease_path=$LEASE_FILE" - (sleep 20 && echodo $REBOOTCMD) & + (sleep 60 && echodo $REBOOTCMD) & disown - (sleep 7 + (sleep 10 log "Trying to stop vdsm for sdUUID=$sdUUID id=$ID lease_path=$LEASE_FILE" echodo $KILL "$VDSM_PID" - sleep 2 + sleep 10 echodo $KILL -9 "$VDSM_PID" )& disown
automation@ovirt.org has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 1: Verified-1
* Update tracker::#1222564::OK * Check Bug-Url::OK * Check Public Bug::#1222564::OK, public bug * Check Product::#1222564::OK, Correct product Red Hat Enterprise Virtualization Manager * Check TR::#1222564::ERROR, wrong target release for stable branch, 3.6.0 should match ^3.[54321].* * Check merged to previous::WARN, Still open on branches ovirt-3.6
Nir Soffer has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 1:
Needs 3.5 bug
Allon Mureinik has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 1:
(1 comment)
https://gerrit.ovirt.org/#/c/46485/1//COMMIT_MSG Commit Message:
Line 53: 3. Send SIGKILL signal to vdsm after 20 seconds Line 54: 4. Reboot the machine after 60 seconds Line 55: Line 56: Change-Id: Ib71fa06c21602fd9d43516c5b4c997c481708697 Line 57: Bug-Url: https://bugzilla.redhat.com/1222564 https://bugzilla.redhat.com/show_bug.cgi?id=1265177 Line 58: Signed-off-by: Nir Soffer nsoffer@redhat.com Line 59: Reviewed-on: https://gerrit.ovirt.org/46057 Line 60: Continuous-Integration: Jenkins CI
Allon Mureinik has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 1:
+1 on the code, please fix the bug url
automation@ovirt.org has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 2:
* Update tracker::#1265177::OK * Check Bug-Url::OK * Check Public Bug::#1265177::OK, public bug * Check Product::#1265177::OK, Correct product Red Hat Enterprise Virtualization Manager * Check TR::#1265177::OK, correct target release 3.5.6 * Check merged to previous::OK, change not open on any previous branch
Allon Mureinik has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 2: Code-Review+1
Francesco Romani has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 2: Continuous-Integration+1
run "make check" locally
Yaniv Bronhaim has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 2: Code-Review+2
Nir Soffer has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 2: Verified+1
Tested together with https://gerrit.ovirt.org/46484 as described in https://bugzilla.redhat.com/show_bug.cgi?id=1222564#c11
Allon Mureinik has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 2:
Can we move forward with this please?
Yaniv Bronhaim has submitted this change and it was merged.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
safelease: Increase spmprotect timeouts
When spmprotect.sh fail to renew the lease, it start a fencing process:
1. Send SIGUSR1 signal to vdsm 2. Send SIGTERM signal to vdsm after 7 seconds 3. Send SIGKILL signal to vdsm after 9 seconds 4. Reboot the machine after 20 seconds
When vdsm receives the SIGUSR1 signal, the signal handler invokes stopSpm, which releases the cluster lock. Releasing the cluster lock will terminate the waiting spmprotect.sh, preventing termination of vdsm and reboot.
If vdsm fails to release the cluster lock within 7 seconds, spmprotect.sh will terminate it, and if did not terminate, spmprotect.sh will kill it.
When systemd starts vdsm again, vdsm looks for spmprotect.sh processes and tries to release the lease. If the lease cannot be released after 10 seconds, it kills the pending spmprotect.sh processes, preventing reboot.
Testing with both block and file storage show that this flow is broken when access to master domain is blocked:
1. In block storage, vdsm gets stuck trying to unmount the master mount, and spmprotect.sh kills it before it try to release the cluster lock.
2. In file storage, vdsm gets stuck trying to write spm status to the master domain, and spmprotect.sh kills it before it try to release the cluster lock.
3. When vdsm starts up, sometimes it manage to kill the waiting spmprotect.sh process, and sometimes spmprotect.sh reboot the machine before vdsm kills it.
We cannot fix 1 and 2 easily. 3 can be fixed by giving vdsm more time for stopSpm flow, and more time to startup and kill pending spmprotect.sh process.
This patch increases spmprotect timeouts to increase the chance of clean shutdown and decrease the chance of unneeded reboot.
New spmprotect.sh flow is:
1. Send SIGUSR1 signal to vdsm 2. Send SIGTERM signal to vdsm after 10 seconds 3. Send SIGKILL signal to vdsm after 20 seconds 4. Reboot the machine after 60 seconds
Change-Id: Ib71fa06c21602fd9d43516c5b4c997c481708697 Bug-Url: https://bugzilla.redhat.com/1265177 Signed-off-by: Nir Soffer nsoffer@redhat.com Reviewed-on: https://gerrit.ovirt.org/46057 Continuous-Integration: Jenkins CI Reviewed-by: Adam Litke alitke@redhat.com Reviewed-on: https://gerrit.ovirt.org/46485 Reviewed-by: Allon Mureinik amureini@redhat.com Continuous-Integration: Francesco Romani fromani@redhat.com Reviewed-by: Yaniv Bronhaim ybronhei@redhat.com --- M vdsm/storage/protect/spmprotect.sh 1 file changed, 3 insertions(+), 3 deletions(-)
Approvals: Nir Soffer: Verified Yaniv Bronhaim: Looks good to me, approved Allon Mureinik: Looks good to me, but someone else must approve Francesco Romani: Passed CI tests
automation@ovirt.org has posted comments on this change.
Change subject: safelease: Increase spmprotect timeouts ......................................................................
Patch Set 3:
* Update tracker::#1265177::OK * Set MODIFIED::bug 1265177::::#1265177::::IGNORE, not oVirt prod but Red Hat Enterprise Virtualization Manager
vdsm-patches@lists.fedorahosted.org