[PATCH 3/3] Resolve an abnormal exit when consensus timeout expired

Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> · Sat, 26 Nov 2011 17:42:55 +0800

According the testing over the past weeks, this patch can resolve an
abnormal exit when corosync reach consensus timeout expired.

The one reason of this issue is that *mulitcast* message is slower
than *unicast* message in the network. It makes corosync could only
receive orf_token but could not receive some MCAST messages in some
harsh network situation.

To duplicate this issue is hard, but we could not avoid it completely.

The most important is that this patch could work. It will make corosync
which ran into FAILD TO RECEIVE state forms an single-node-ring
instead of exit casusing by assert instruction.

After applying this patch, I have observed the whold process that:
one of corosync separated from the cluster => it forms an
single-node-ring => rejoin the cluster with other nodes again.

Although we haven't found the root cause, but this patch could give
corosync a chance to rejoin the cluster.

Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
---
 exec/totemsrp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/exec/totemsrp.c b/exec/totemsrp.c
index db4e3bb..e4ad02f 100644
--- a/exec/totemsrp.c
+++ b/exec/totemsrp.c
@@ -1222,7 +1222,7 @@ static int memb_consensus_agreed (
 			break;
 		}
 	}
-	assert (token_memb_entries >= 1);
+	assert (token_memb_entries >= 0);
 
 	return (agreed);
 }
-- 
1.7.7.3

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss