Re: [PATCH 3/3] Resolve an abnormal exit when consensus timeout expired

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/26/2011 02:42 AM, Yunkai Zhang wrote:
> According the testing over the past weeks, this patch can resolve an
> abnormal exit when corosync reach consensus timeout expired.
> 
> The one reason of this issue is that *mulitcast* message is slower
> than *unicast* message in the network. It makes corosync could only
> receive orf_token but could not receive some MCAST messages in some
> harsh network situation.
> 
> To duplicate this issue is hard, but we could not avoid it completely.
> 
> The most important is that this patch could work. It will make corosync
> which ran into FAILD TO RECEIVE state forms an single-node-ring
> instead of exit casusing by assert instruction.
> 
> After applying this patch, I have observed the whold process that:
> one of corosync separated from the cluster => it forms an
> single-node-ring => rejoin the cluster with other nodes again.
> 
> Although we haven't found the root cause, but this patch could give
> corosync a chance to rejoin the cluster.
> 
> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
> ---
>  exec/totemsrp.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
> index db4e3bb..e4ad02f 100644
> --- a/exec/totemsrp.c
> +++ b/exec/totemsrp.c
> @@ -1222,7 +1222,7 @@ static int memb_consensus_agreed (
>  			break;
>  		}
>  	}
> -	assert (token_memb_entries >= 1);
> +	assert (token_memb_entries >= 0);
>  
>  	return (agreed);
>  }

This workround has been proposed several times, but I'm not satisfied
with this solution.  token_memb_entries should never be zero (atleast
the local processor should be in the membership list).  I'd rather get
to the root cause of this problem vs hacking around the problem.

Note the BZ where this is discussed:

https://bugzilla.redhat.com/show_bug.cgi?id=671575

There is a good test case there that highly reproduces the problem if
you want to take a look.

Regards
-steve


_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux