On Sun, Nov 27, 2011 at 6:57 AM, Steven Dake <sdake@xxxxxxxxxx> wrote: > On 11/26/2011 09:22 AM, Steven Dake wrote: >> On 11/26/2011 02:42 AM, Yunkai Zhang wrote: >>> According the testing over the past weeks, this patch can resolve an >>> abnormal exit when corosync reach consensus timeout expired. >>> >>> The one reason of this issue is that *mulitcast* message is slower >>> than *unicast* message in the network. It makes corosync could only >>> receive orf_token but could not receive some MCAST messages in some >>> harsh network situation. >>> >>> To duplicate this issue is hard, but we could not avoid it completely. >>> >>> The most important is that this patch could work. It will make corosync >>> which ran into FAILD TO RECEIVE state forms an single-node-ring >>> instead of exit casusing by assert instruction. >>> >>> After applying this patch, I have observed the whold process that: >>> one of corosync separated from the cluster => it forms an >>> single-node-ring => rejoin the cluster with other nodes again. >>> >>> Although we haven't found the root cause, but this patch could give >>> corosync a chance to rejoin the cluster. >>> >>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> >>> --- >>> exec/totemsrp.c | 2 +- >>> 1 files changed, 1 insertions(+), 1 deletions(-) >>> >>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c >>> index db4e3bb..e4ad02f 100644 >>> --- a/exec/totemsrp.c >>> +++ b/exec/totemsrp.c >>> @@ -1222,7 +1222,7 @@ static int memb_consensus_agreed ( >>> break; >>> } >>> } >>> - assert (token_memb_entries >= 1); >>> + assert (token_memb_entries >= 0); >>> >>> return (agreed); >>> } >> >> This workround has been proposed several times, but I'm not satisfied >> with this solution. token_memb_entries should never be zero (atleast >> the local processor should be in the membership list). I'd rather get >> to the root cause of this problem vs hacking around the problem. >> >> Note the BZ where this is discussed: >> >> https://bugzilla.redhat.com/show_bug.cgi?id=671575 >> >> There is a good test case there that highly reproduces the problem if >> you want to take a look. >> >> Regards >> -steve >> >> > > Thinking more about this and specifically your analysis that the cause > of the problem is delayed unicast messages with respect to multicast > messages, there might be an interesting way to solve this problem. We > worked around the delayed multicast by keeping a list in the sort queue > and increasing it by one on each node a message was missed. This might > make a good location to detect failure to receive multicast as well. > (with the action of doing something about it located in orf handler > instead of both detection and recovery handled in orf token. > > I'll consider it more. > We are looking forward to this solution :) Thanks. > Regards > -steve > >> _______________________________________________ >> discuss mailing list >> discuss@xxxxxxxxxxxx >> http://lists.corosync.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss > -- Yunkai Zhang Work at Taobao _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss