cluster: RHEL6 - fenced: fix handling of startup partition merge
by David Teigland
Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=...
Commit: 5457043e975ba4c44c17138b3751150b974aa35c
Parent: af612b16a8b565a0e3543850367e0b58a43922cd
Author: David Teigland <teigland(a)redhat.com>
AuthorDate: Mon Oct 31 12:06:41 2011 -0500
Committer: David Teigland <teigland(a)redhat.com>
CommitterDate: Thu Mar 1 16:07:29 2012 -0600
fenced: fix handling of startup partition merge
The victims created on each side of a partition are cleared after
a merge in the receive_complete function, which is meant to clear
"initial victims". Clearing the victims is the correct end result,
but the code arrives there through an unintended shortcut. Change
the code so it clears the victims in a "more correct", and probably
safer way:
2 is blocked doing startup fencing
1 joins fence domain
partition between 1,2
1 sees confchg for partition, adds victim 2 (and sets init_victim
per this patch, since 1 hasn't finished a start cycle yet)
partition removed
2 completes startup fencing
2 sees confchg for partition, adds victim 1
1 sees confchg for merge, adds node 2
2 sees confchg for merge, adds node 1
1 processing the merge confchg
2 reduces victim 1 from partition, since 1 had no state (had not yet
completed a start cycle)
2 processing the merge confchg
1,2 finish start cycle for merge confchg
2 sends complete for merge confchg
1 clears victim 2 in receive_complete because it set init_victim
bz 750314
Signed-off-by: David Teigland <teigland(a)redhat.com>
---
fence/fenced/cpg.c | 36 ++++++++++++++++++++++++++++++++++--
1 files changed, 34 insertions(+), 2 deletions(-)
diff --git a/fence/fenced/cpg.c b/fence/fenced/cpg.c
index 99e16a0..6634f8c 100644
--- a/fence/fenced/cpg.c
+++ b/fence/fenced/cpg.c
@@ -1132,8 +1132,11 @@ static void receive_complete(struct fd *fd, struct fd_header *hd, int len)
list_for_each_entry_safe(node, safe, &fd->victims, list) {
log_debug("receive_complete clear victim nodeid %d init %d",
node->nodeid, node->init_victim);
- list_del(&node->list);
- free(node);
+
+ if (node->init_victim) {
+ list_del(&node->list);
+ free(node);
+ }
}
}
@@ -1319,6 +1322,35 @@ static void add_victims(struct fd *fd, struct change *cg)
return;
list_add(&node->list, &fd->victims);
log_debug("add_victims node %d", node->nodeid);
+
+ /*
+ * If we haven't completed a start cycle yet, set
+ * init_victim on any failed node so that receive_complete
+ * will clear it. This is a hack for one specific scenario:
+ *
+ * - node 2 joins domain, blocks in startup fencing
+ * - node 1 joins domain, waiting for messages in start cycle
+ * - partition between 1,2
+ * - 1 adds victim 2
+ * (and sets init_victim below since 1 hasn't completed
+ * a start cycle yet)
+ * - partition removed
+ * - node 2 completes startup fencing
+ * - 2 gets confchg for partition
+ * - 2 adds victim 1 (due to partition)
+ * - 2 gets confchg for merge
+ * - 2 does join for 1 (due to merge), begins start cycle
+ * - start cycle adding node 1 finishes, 2 sends complete
+ * - 2 reduces victim 1
+ * - 1 receives complete for its join start cycle,
+ * and clears victim 2 because we've set init_victim here
+ */
+
+ if (!fd->started_count) {
+ log_debug("add_victims node %d set init_victim",
+ node->nodeid);
+ node->init_victim = 1;
+ }
}
}
12 years, 4 months
cluster: RHEL6 - rgmanager: Retry when config is out of sync
by Lon Hohberger
Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=...
Commit: af612b16a8b565a0e3543850367e0b58a43922cd
Parent: fdcec853307c5d9ce0517cb0088de2e970f76ae7
Author: Lon Hohberger <lhh(a)redhat.com>
AuthorDate: Thu Aug 5 16:53:22 2010 -0400
Committer: Lon Hohberger <lhh(a)redhat.com>
CommitterDate: Thu Mar 1 14:15:52 2012 -0500
rgmanager: Retry when config is out of sync
If you add a service to rgmanager v1 or v2 and that
service fails to start on the first node but succeeds
in its initial stop operation, there is a chance that
the remote instance of rgmanager has not yet reread
the configuration, causing the service to be placed
into the 'recovering' state without further action.
This patch causes the originator of the request to
retry the operation.
Later versions of rgmanager (ex STABLE3 branch and
derivatives) are unlikely to have this problem since
configuration updates are not polled, but rather
delivered to clients.
Update 22-Feb-2012: The above is incorrect, this was
reproduced a rgmanager v3 installation.
Resolves: rhbz#796272
Signed-off-by: Lon Hohberger <lhh(a)redhat.com>
Reviewed-by: Fabio M. Di Nitto <fdinitto(a)redhat.com>
---
rgmanager/src/daemons/rg_state.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/rgmanager/src/daemons/rg_state.c b/rgmanager/src/daemons/rg_state.c
index 23a4bec..8c5af5b 100644
--- a/rgmanager/src/daemons/rg_state.c
+++ b/rgmanager/src/daemons/rg_state.c
@@ -1801,6 +1801,7 @@ handle_relocate_req(char *svcName, int orig_request, int preferred_target,
rg_state_t svcStatus;
int target = preferred_target, me = my_id();
int ret, x, request = orig_request;
+ int retries;
get_rg_state_local(svcName, &svcStatus);
if (svcStatus.rs_state == RG_STATE_DISABLED ||
@@ -1933,6 +1934,8 @@ handle_relocate_req(char *svcName, int orig_request, int preferred_target,
if (target == me)
goto exhausted;
+ retries = 0;
+retry:
ret = svc_start_remote(svcName, request, target);
switch (ret) {
case RG_ERUN:
@@ -1942,6 +1945,22 @@ handle_relocate_req(char *svcName, int orig_request, int preferred_target,
*new_owner = svcStatus.rs_owner;
free_member_list(allowed_nodes);
return 0;
+ case RG_ENOSERVICE:
+ /*
+ * Configuration update pending on remote node? Give it
+ * a few seconds to sync up. rhbz#568126
+ *
+ * Configuration updates are synchronized in later releases
+ * of rgmanager; this should not be needed.
+ */
+ if (retries++ < 4) {
+ sleep(3);
+ goto retry;
+ }
+ logt_print(LOG_WARNING, "Member #%d has a different "
+ "configuration than I do; trying next "
+ "member.", target);
+ /* Deliberate */
case RG_EDEPEND:
case RG_EFAIL:
/* Uh oh - we failed to relocate to this node.
12 years, 4 months
cluster: STABLE32 - rgmanager: Small bug in follow-service.sl
by Lon Hohberger
Gitweb: http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=commitdiff;h=...
Commit: c7d3938a3856f9cb295dc6aed8b7f86762cbed7c
Parent: b2012d8fe8b6a30f16091a8c96b5665e34892160
Author: Marc Grimme <grimme(a)atix.de>
AuthorDate: Thu Mar 1 14:06:01 2012 -0500
Committer: Lon Hohberger <lhh(a)redhat.com>
CommitterDate: Thu Mar 1 14:06:01 2012 -0500
rgmanager: Small bug in follow-service.sl
Follow-service was written for use with failover
domains.
When using follow-service without a failover domain,
the available nodelist would be nil.
This patch resolves that issue.
Signed-off-by: Lon Hohberger <lhh(a)redhat.com>
---
rgmanager/src/resources/follow-service.sl | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/rgmanager/src/resources/follow-service.sl b/rgmanager/src/resources/follow-service.sl
index 4c711ec..6c17160 100644
--- a/rgmanager/src/resources/follow-service.sl
+++ b/rgmanager/src/resources/follow-service.sl
@@ -6,7 +6,7 @@
% Author: Marc Grimme, Mark Hlawatschek, October 2008
% Support: support(a)atix.de
% License: GNU General Public License (GPL), version 2 or later
-% Copyright: (c) 2008-2010 ATIX AG
+% Copyright: (c) 2008-2012 ATIX AG
debug("*** follow-service.sl");
@@ -21,7 +21,13 @@ define nodelist_online(service_name) {
(nofailback, restricted, ordered, node_list) = service_domain_info(service_name);
- return intersection(nodes, node_list);
+ if ((node_list == NULL) or (node_list == 0)) {
+ debug("service ",service_name, " has no failover domain. Taking all available nodes: ", nodes);
+ return nodes;
+ } else {
+ debug("service ",service_name, " has a failover domain. Taking intersection with available nodes: ", nodes, " => ", node_list);
+ return intersection(nodes, node_list);
+ }
}
%
12 years, 4 months