cluster: RHEL511 - gfs_controld: fix plock transfer during first mount recovery - cluster-commits - Fedora mailing-lists

13 Dec 2013

Gitweb:        http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=6d60ac310fcb22...
Commit:        6d60ac310fcb224ca921bb78c8c43fd80ec1b03e
Parent:        e675576929f09ee208f22fdfdcb750dba92fe00f
Author:        David Teigland teigland@redhat.com
AuthorDate:    Mon Oct 14 11:42:26 2013 -0500
Committer:     David Teigland teigland@redhat.com
CommitterDate: Fri Dec 13 10:17:10 2013 -0600
gfs_controld: fix plock transfer during first mount recovery
The plock checkpoint is not unlinked properly during certain
first mount recovery situations (lower nodeid mounts while higher
nodeid is doing first mounter recovery).  This leaves a stray
checkpoint that prevents the following checkpoint from being created,
which causes plock state to not be transferred to mounting nodes,
which can lead to a plock being granted in multiple places at once.
node2: mount /gfs (it does first mount recovery)
node1: mount /gfs (while node2 is still doing first mount recovery)
node2: creates a plock checkpoint (empty) for node1, then closes
       checkpoint because new low nodeid is now in charge of it
node2: sends journal info to node1
node1: gets journal info from node1
       Takes special code path because node1 is still doing first
       recovery.  Does not call retrieve_plocks on this code path
       because there are no plocks to retrieve in this case.  But,
       the retrieve_plocks function is also responsible for unlinking
       the existing checkpoint on a new low nodeid, which this is.
       So, node1 does not unlink the checkpoint as it should.
node2: finishes first mount recovery, completes mount
node1: notified that node2's first recovery is done, completes mount
node2: doplock /gfs/test (granted)
node1: killed
node1: restarts
node1: mount /gfs
node2: tries to create checkpoint to transfer the plock state to node1,
       but this fails because the checkpoint exists, because node1 did
       not unlink it above.  So, plock state is not transfered to node1.
node1: doplock /gfs/test (granted)
The result is that both nodes have the same plock granted concurrently.
The solution is for node1 to call retrieve_plocks on the first mounter
code path, as it does on the normal code path.  retrieve_plocks will
unlink the checkpoint in this case.
This patch also adds a lower level backup method to create plock
checkpoints if an unlink was missed in some cases.  If
store_checkpoints finds the checkpoint exists, it will try once
to unlink it and recreate it.
Signed-off-by: David Teigland teigland@redhat.com
---
 group/gfs_controld/plock.c   |   11 +++++++----
 group/gfs_controld/recover.c |    4 ++++
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/group/gfs_controld/plock.c b/group/gfs_controld/plock.c
index d96604f..51e0882 100644
--- a/group/gfs_controld/plock.c
+++ b/group/gfs_controld/plock.c
@@ -2012,6 +2012,7 @@ void store_plocks(struct mountgroup *mg, int nodeid)
    struct lock_waiter *w;
    int r_count, lock_count, total_size, section_size, max_section_size;
    int len, owner;
+	int retry_count = 0;
if (!plocks_online)
    	return;
@@ -2087,13 +2088,15 @@ void store_plocks(struct mountgroup *mg, int nodeid)
    if (rv == SA_AIS_ERR_EXIST) {
    	log_group(mg, "store_plocks: ckpt already exists");
    	log_error("store_plocks: ckpt already exists");
-		/* TODO: best to unlink and retry? */
-		/*
+		/* We should in general be unlinking the ckpt in the
+		   proper places to avoid hitting this, but there are
+		   probably some cases where we miss the unlink, so
+		   this is a backup method. */
+		if (retry_count++)
+			return;
    	_unlink_checkpoint(mg, &name);
    	sleep(1);
    	goto open_retry;
-		*/
-		return;
    }
    if (rv != SA_AIS_OK) {
    	log_error("store_plocks: ckpt open error %d %s", rv, mg->name);
diff --git a/group/gfs_controld/recover.c b/group/gfs_controld/recover.c
index f70f798..87eee63 100644
--- a/group/gfs_controld/recover.c
+++ b/group/gfs_controld/recover.c
@@ -1018,6 +1018,10 @@ void received_our_jid(struct mountgroup *mg)
    	log_group(mg, "other node doing first mounter recovery, "
    		  "set mount_client_delay");
    	mg->mount_client_delay = 1;
+		/* There should be no plocks to retrieve since the fs is being
+		   mounted initially, but retrieve is needed to unlink an
+		   existing checkpoint if we are the new master. */
+		retrieve_plocks(mg);
    	mg->save_plocks = 0;
    	return;
    }