cluster: RHEL59 - gfs_controld: fix plock recovery - cluster-commits - Fedora mailing-lists

17 Dec 2013

Gitweb:        http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=c61e71766564cd...
Commit:        c61e71766564cdeaea3c46cfce061000a0aa3879
Parent:        dca24557dceec608a0e5a36efcc1005fd03f4549
Author:        David Teigland teigland@redhat.com
AuthorDate:    Thu Oct 10 11:58:58 2013 -0500
Committer:     Christine Caulfield ccaulfie@redhat.com
CommitterDate: Tue Dec 17 10:37:50 2013 +0000
gfs_controld: fix plock recovery
When there are two nodes in the cluster, and the
the node in charge of the plock checkpoint fails,
the remaining node does not unlink the checkpoint
that had been created by the failed node.  When
the failed node returns, and the new node attempts
to transfer plock state, it fails to create a new
checkpoint because it did not unlink the previous
checkpoint created by the failed node.  This leads
to any existing plock state not being transferred
to the newly joined node.  The newly joined node
will then mistakenly grant plocks to itself that
may conflict with plocks that the other node could
not transfer.  This leads to:
1. conflicting plocks being held concurrently
2. dangling plocks that are not held but not removed
In the explanation above, the reason the remaining
node does not unlink the checkpoint that had been
created by the other node, is that it does not know
that the other node was in charge of the checkpoint.
It could only know this if it had been present before
and after the previous membership change.  Because
there are only two nodes, this was not possible.
This, however, is also the point exploited to fix
the problem.  When there are only two members, a new
node can assume that the other node is in charge of
the checkpoint.
The following test shows the problem/fix using
a program "doplock" that requests an exclusive,
blocking posix lock on the given file.
node1: mount /gfs
node2: mount /gfs
node1: touch /gfs/test
node1: doplock /gfs/test (granted)
node2: doplock /gfs/test (blocks)
node1: killed
node2: recovery for node1
node2: doplock above granted the lock
node1: restarts
node1: mount /gfs
node1: doplock /gfs/test
In the last step, the node1 doplock should block
because node2 holds the lock.  Before the fix,
it was granted.
Signed-off-by: David Teigland teigland@redhat.com
---
 group/gfs_controld/plock.c   |    7 +++++++
 group/gfs_controld/recover.c |   14 ++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/group/gfs_controld/plock.c b/group/gfs_controld/plock.c
index 4330a2c..d96604f 100644
--- a/group/gfs_controld/plock.c
+++ b/group/gfs_controld/plock.c
@@ -2086,6 +2086,13 @@ void store_plocks(struct mountgroup *mg, int nodeid)
    }
    if (rv == SA_AIS_ERR_EXIST) {
    	log_group(mg, "store_plocks: ckpt already exists");
+		log_error("store_plocks: ckpt already exists");
+		/* TODO: best to unlink and retry? */
+		/*
+		_unlink_checkpoint(mg, &name);
+		sleep(1);
+		goto open_retry;
+		*/
    	return;
    }
    if (rv != SA_AIS_OK) {
diff --git a/group/gfs_controld/recover.c b/group/gfs_controld/recover.c
index b33b3fd..f70f798 100644
--- a/group/gfs_controld/recover.c
+++ b/group/gfs_controld/recover.c
@@ -1257,8 +1257,15 @@ void update_master_nodeid(struct mountgroup *mg)
 {
    struct mg_member *memb;
    int new = -1, low = -1;
+	int other_nodeid = -1;
+	int total = 0;
list_for_each_entry(memb, &mg->members, list) {
+		total++;
+
+		if (memb->nodeid != our_nodeid)
+			other_nodeid = memb->nodeid;
+
    	if (low == -1 || memb->nodeid < low)
    		low = memb->nodeid;
    	if (!memb->finished)
@@ -1268,6 +1275,9 @@ void update_master_nodeid(struct mountgroup *mg)
    }
    mg->master_nodeid = new;
    mg->low_nodeid = low;
+
+	if (new == -1 && total == 2)
+		mg->master_nodeid = other_nodeid;
 }
/* This can happen before we receive a journals message for our mount. */
@@ -1354,8 +1364,8 @@ void recover_members(struct mountgroup *mg, int num_nodes,
    *pos_out = pos;
    *neg_out = neg;
-	log_group(mg, "total members %d master_nodeid %d prev %d",
-		  mg->memb_count, mg->master_nodeid, prev_master_nodeid);
+	log_group(mg, "total members %d master_nodeid %d prev %d failed %d",
+		  mg->memb_count, mg->master_nodeid, prev_master_nodeid, master_failed);
/* The master failed and we're the new master, we need to: