Re: Corosync 1.3.x/1.4.x: Random redundant ring instabilities

Steven Dake <sdake@xxxxxxxxxx> · Fri, 08 Jun 2012 07:46:37 -0700

On 06/07/2012 02:04 AM, Jan Friesse wrote:
> Jerome,
> I believe first and second behavior is same as described in
> https://bugzilla.redhat.com/show_bug.cgi?id=820821 by Andrew. I'm not
> yet entirely sure WHY is happening.
> 
> Third one, flushing, is very important. Without flush, buffer may start
> to overload and it causes really bad behavior (there was BZ with this
> problem).
> 
> I would like Steve to review your patch, but for me it looks like ok.
> 
> Regards,
>   Honza
> 

I looked at the patch, and it should be fine.  Unfortunately as I was in
the process of applying it, the email client ate the message.

Honza, if you still have a copy can you merge that patch?

Consider it

Reviewed-by: Steven Dake <sdake@xxxxxxxxxx>

> Jerome FLESCH napsal(a):
>> Hello,
>>
>> When upgrading from Corosync 1.2.8 to Corosync 1.4.2/1.4.3, some nasty
>> bugs appeared on our clusters. I observed the following bad behaviors:
>> 1) A process connected to Corosync with CPG wasn't correctly informed
>> that there are other processes connected on other processors. It also
>> didn't get their messages
>> 2) A process sending messages with CPG never received copies of its
>> messages
>> 3) 1 ring out of 2 went up/down quite often
>>
>> The behaviors 1 and 2 are very hard for us to reproduce, but we are
>> able to get the behavior 3 quite easily.
>>
>> The simplest setup we found to get it is the following:
>> - 2 VirtualBox VMs, connected by 2 network interfaces (vboxnet0,
>> vboxnet1 ; one for each ring)
>> - OS: Linux (Debian stable)
>> - On one of the VMs, a test program sending some CPG messages (see the
>> script "test_corosync.sh" joined to this mail for example)
>>
>> Here are the Corosync logs we get when we do this setup:
>>
>> Jun 06 16:23:40 corosync [TOTEM ] A processor joined or left the
>> membership and a new membership was formed.
>> Jun 06 16:23:40 corosync [CPG   ] chosen downlist: sender r(0)
>> ip(192.168.56.104) r(1) ip(192.168.57.104) ; members(old:1 left:0)
>> Jun 06 16:23:40 corosync [MAIN  ] Completed service synchronization,
>> ready to provide service.
>> Jun 06 16:24:37 corosync [TOTEM ] Marking ringid 1 interface
>> 192.168.57.105 FAULTY
>> Jun 06 16:24:38 corosync [TOTEM ] Automatically recovered ring 1
>> Jun 06 16:25:33 corosync [TOTEM ] Marking ringid 1 interface
>> 192.168.57.105 FAULTY
>> Jun 06 16:25:34 corosync [TOTEM ] Automatically recovered ring 1
>> Jun 06 16:26:35 corosync [TOTEM ] Marking ringid 1 interface
>> 192.168.57.105 FAULTY
>> Jun 06 16:26:36 corosync [TOTEM ] Automatically recovered ring 1
>> (...)
>>
>> The second ring goes down about every 2 minutes and automatically back
>> up right after.
>>
>> We spent some times looking for the commit that introduced this bug,
>> and it appears it's due the following one:
>> Corosync 1.3.3 ->  1.3.4: e27a58d93d0d3795beb550f87b660c9c04f11386
>> Corosync 1.4.1 ->  1.4.2: be608c050247e5f9c8266b8a0f9803cc0a3dc881
>> Commit message: Ignore memb_join messages during flush operations
>>
>> I had a look at this commit, and it seems to me it's dropping too many
>> packets:
>> Because of this commit, while totemrrp_recv_flush() is called,
>> Corosync drops memb_join packets, but also ORF tokens. In the end, it
>> seems that sometimes, we drop so many of them that Corosync marks the
>> ring as faulty.
>>
>> To fix that, I've made the patch joined to this mail
>> (corosync-fix-token-drop.patch).
>>
>> However I wonder why this packet dropping is done at such a low layer.
>> Wouldn't it be more appropriate to do it in totemsrp.c ?
>> Moreover, it seems to me that totemrrp_recv_flush() is called every
>> times Corosync get an ORF token (in message_handler_orf_token()). It
>> seems weird to me because the commit message says the packets should
>> only be dropped when we are in gather state to avoid switching
>> suddenly to recovery state.
>>
>> Also, could you tell me if this packet dropping could explain the 2
>> other behaviors I observed ?
>>
>> Thanks in advance,
>>
>> Regards,
>>
>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss