According the testing over the past weeks, this patch can resolve an abnormal exit when corosync reach consensus timeout expired. The one reason of this issue is that *mulitcast* message is slower than *unicast* message in the network. It makes corosync could only receive orf_token but could not receive some MCAST messages in some harsh network situation. To duplicate this issue is hard, but we could not avoid it completely. The most important is that this patch could work. It will make corosync which ran into FAILD TO RECEIVE state forms an single-node-ring instead of exit casusing by assert instruction. After applying this patch, I have observed the whold process that: one of corosync separated from the cluster => it forms an single-node-ring => rejoin the cluster with other nodes again. Although we haven't found the root cause, but this patch could give corosync a chance to rejoin the cluster. Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> --- exec/totemsrp.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/exec/totemsrp.c b/exec/totemsrp.c index db4e3bb..e4ad02f 100644 --- a/exec/totemsrp.c +++ b/exec/totemsrp.c @@ -1222,7 +1222,7 @@ static int memb_consensus_agreed ( break; } } - assert (token_memb_entries >= 1); + assert (token_memb_entries >= 0); return (agreed); } -- 1.7.7.3 _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss