Re: [PATCH 3/3] Resolve an abnormal exit when consensus timeout expired

Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> · Mon, 28 Nov 2011 09:21:28 +0800

On Sun, Nov 27, 2011 at 6:57 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
> On 11/26/2011 09:22 AM, Steven Dake wrote:
>> On 11/26/2011 02:42 AM, Yunkai Zhang wrote:
>>> According the testing over the past weeks, this patch can resolve an
>>> abnormal exit when corosync reach consensus timeout expired.
>>>
>>> The one reason of this issue is that *mulitcast* message is slower
>>> than *unicast* message in the network. It makes corosync could only
>>> receive orf_token but could not receive some MCAST messages in some
>>> harsh network situation.
>>>
>>> To duplicate this issue is hard, but we could not avoid it completely.
>>>
>>> The most important is that this patch could work. It will make corosync
>>> which ran into FAILD TO RECEIVE state forms an single-node-ring
>>> instead of exit casusing by assert instruction.
>>>
>>> After applying this patch, I have observed the whold process that:
>>> one of corosync separated from the cluster => it forms an
>>> single-node-ring => rejoin the cluster with other nodes again.
>>>
>>> Although we haven't found the root cause, but this patch could give
>>> corosync a chance to rejoin the cluster.
>>>
>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>>> ---
>>>  exec/totemsrp.c |    2 +-
>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
>>> index db4e3bb..e4ad02f 100644
>>> --- a/exec/totemsrp.c
>>> +++ b/exec/totemsrp.c
>>> @@ -1222,7 +1222,7 @@ static int memb_consensus_agreed (
>>>                      break;
>>>              }
>>>      }
>>> -    assert (token_memb_entries >= 1);
>>> +    assert (token_memb_entries >= 0);
>>>
>>>      return (agreed);
>>>  }
>>
>> This workround has been proposed several times, but I'm not satisfied
>> with this solution.  token_memb_entries should never be zero (atleast
>> the local processor should be in the membership list).  I'd rather get
>> to the root cause of this problem vs hacking around the problem.
>>
>> Note the BZ where this is discussed:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=671575
>>
>> There is a good test case there that highly reproduces the problem if
>> you want to take a look.
>>
>> Regards
>> -steve
>>
>>
>
> Thinking more about this and specifically your analysis that the cause
> of the problem is delayed unicast messages with respect to multicast
> messages, there might be an interesting way to solve this problem.  We
> worked around the delayed multicast by keeping a list in the sort queue
> and increasing it by one on each node a message was missed.  This might
> make a good location to detect failure to receive multicast as well.
> (with the action of doing something about it located in orf handler
> instead of both detection and recovery handled in orf token.
>
> I'll consider it more.
>

We are looking forward to this solution :)

Thanks.

> Regards
> -steve
>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
>

-- 
Yunkai Zhang
Work at Taobao
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss