Re: [PATCH 3/3] Resolve an abnormal exit when consensus timeout expired

Steven Dake <sdake@xxxxxxxxxx> · Sat, 26 Nov 2011 15:57:41 -0700

On 11/26/2011 09:22 AM, Steven Dake wrote:
> On 11/26/2011 02:42 AM, Yunkai Zhang wrote:
>> According the testing over the past weeks, this patch can resolve an
>> abnormal exit when corosync reach consensus timeout expired.
>>
>> The one reason of this issue is that *mulitcast* message is slower
>> than *unicast* message in the network. It makes corosync could only
>> receive orf_token but could not receive some MCAST messages in some
>> harsh network situation.
>>
>> To duplicate this issue is hard, but we could not avoid it completely.
>>
>> The most important is that this patch could work. It will make corosync
>> which ran into FAILD TO RECEIVE state forms an single-node-ring
>> instead of exit casusing by assert instruction.
>>
>> After applying this patch, I have observed the whold process that:
>> one of corosync separated from the cluster => it forms an
>> single-node-ring => rejoin the cluster with other nodes again.
>>
>> Although we haven't found the root cause, but this patch could give
>> corosync a chance to rejoin the cluster.
>>
>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>> ---
>>  exec/totemsrp.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
>> index db4e3bb..e4ad02f 100644
>> --- a/exec/totemsrp.c
>> +++ b/exec/totemsrp.c
>> @@ -1222,7 +1222,7 @@ static int memb_consensus_agreed (
>>  			break;
>>  		}
>>  	}
>> -	assert (token_memb_entries >= 1);
>> +	assert (token_memb_entries >= 0);
>>  
>>  	return (agreed);
>>  }
> 
> This workround has been proposed several times, but I'm not satisfied
> with this solution.  token_memb_entries should never be zero (atleast
> the local processor should be in the membership list).  I'd rather get
> to the root cause of this problem vs hacking around the problem.
> 
> Note the BZ where this is discussed:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=671575
> 
> There is a good test case there that highly reproduces the problem if
> you want to take a look.
> 
> Regards
> -steve
> 
> 

Thinking more about this and specifically your analysis that the cause
of the problem is delayed unicast messages with respect to multicast
messages, there might be an interesting way to solve this problem.  We
worked around the delayed multicast by keeping a list in the sort queue
and increasing it by one on each node a message was missed.  This might
make a good location to detect failure to receive multicast as well.
(with the action of doing something about it located in orf handler
instead of both detection and recovery handled in orf token.

I'll consider it more.

Regards
-steve

> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss