On 06/07/2012 02:04 AM, Jan Friesse wrote: > Jerome, > I believe first and second behavior is same as described in > https://bugzilla.redhat.com/show_bug.cgi?id=820821 by Andrew. I'm not > yet entirely sure WHY is happening. > > Third one, flushing, is very important. Without flush, buffer may start > to overload and it causes really bad behavior (there was BZ with this > problem). > > I would like Steve to review your patch, but for me it looks like ok. > > Regards, > Honza > I looked at the patch, and it should be fine. Unfortunately as I was in the process of applying it, the email client ate the message. Honza, if you still have a copy can you merge that patch? Consider it Reviewed-by: Steven Dake <sdake@xxxxxxxxxx> > Jerome FLESCH napsal(a): >> Hello, >> >> When upgrading from Corosync 1.2.8 to Corosync 1.4.2/1.4.3, some nasty >> bugs appeared on our clusters. I observed the following bad behaviors: >> 1) A process connected to Corosync with CPG wasn't correctly informed >> that there are other processes connected on other processors. It also >> didn't get their messages >> 2) A process sending messages with CPG never received copies of its >> messages >> 3) 1 ring out of 2 went up/down quite often >> >> The behaviors 1 and 2 are very hard for us to reproduce, but we are >> able to get the behavior 3 quite easily. >> >> The simplest setup we found to get it is the following: >> - 2 VirtualBox VMs, connected by 2 network interfaces (vboxnet0, >> vboxnet1 ; one for each ring) >> - OS: Linux (Debian stable) >> - On one of the VMs, a test program sending some CPG messages (see the >> script "test_corosync.sh" joined to this mail for example) >> >> Here are the Corosync logs we get when we do this setup: >> >> Jun 06 16:23:40 corosync [TOTEM ] A processor joined or left the >> membership and a new membership was formed. >> Jun 06 16:23:40 corosync [CPG ] chosen downlist: sender r(0) >> ip(192.168.56.104) r(1) ip(192.168.57.104) ; members(old:1 left:0) >> Jun 06 16:23:40 corosync [MAIN ] Completed service synchronization, >> ready to provide service. >> Jun 06 16:24:37 corosync [TOTEM ] Marking ringid 1 interface >> 192.168.57.105 FAULTY >> Jun 06 16:24:38 corosync [TOTEM ] Automatically recovered ring 1 >> Jun 06 16:25:33 corosync [TOTEM ] Marking ringid 1 interface >> 192.168.57.105 FAULTY >> Jun 06 16:25:34 corosync [TOTEM ] Automatically recovered ring 1 >> Jun 06 16:26:35 corosync [TOTEM ] Marking ringid 1 interface >> 192.168.57.105 FAULTY >> Jun 06 16:26:36 corosync [TOTEM ] Automatically recovered ring 1 >> (...) >> >> The second ring goes down about every 2 minutes and automatically back >> up right after. >> >> We spent some times looking for the commit that introduced this bug, >> and it appears it's due the following one: >> Corosync 1.3.3 -> 1.3.4: e27a58d93d0d3795beb550f87b660c9c04f11386 >> Corosync 1.4.1 -> 1.4.2: be608c050247e5f9c8266b8a0f9803cc0a3dc881 >> Commit message: Ignore memb_join messages during flush operations >> >> I had a look at this commit, and it seems to me it's dropping too many >> packets: >> Because of this commit, while totemrrp_recv_flush() is called, >> Corosync drops memb_join packets, but also ORF tokens. In the end, it >> seems that sometimes, we drop so many of them that Corosync marks the >> ring as faulty. >> >> To fix that, I've made the patch joined to this mail >> (corosync-fix-token-drop.patch). >> >> However I wonder why this packet dropping is done at such a low layer. >> Wouldn't it be more appropriate to do it in totemsrp.c ? >> Moreover, it seems to me that totemrrp_recv_flush() is called every >> times Corosync get an ORF token (in message_handler_orf_token()). It >> seems weird to me because the commit message says the packets should >> only be dropped when we are in gather state to avoid switching >> suddenly to recovery state. >> >> Also, could you tell me if this packet dropping could explain the 2 >> other behaviors I observed ? >> >> Thanks in advance, >> >> Regards, >> >> >> >> _______________________________________________ >> discuss mailing list >> discuss@xxxxxxxxxxxx >> http://lists.corosync.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss