Re: issue about two nodes could not be merged into one ring

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





Hi Steven,

On 2013年11月7日, at 0:36, Steven Dake <sdake@xxxxxxxxxx> wrote:

Jason,

Nice dig into the code/totem.  Hope you didn't break this
bank on red bull :)  I have a few comments inline:
Well, at least better than the guy from crystal lake ;).

On 11/06/2013 07:16 AM, jason wrote:
Hi All,
I currently encountered a problem that two nodes could not be merged into one ring.
Initially, there were three nodes in a ring, say A, B and C. Then, after killing C, I found that A and B could not be merged forever (I wait at least 4 hours), unless restart at least one of them.
By analyzing the black box log, both A and B have a dead loop in doing the following things:
1. Form a single node ring.
2. The ring is broken by a JOIN message from peer.
3. Try to form a two-node ring but consensus timeout.
4. Go to 1.

I checked the network by using omping and it was OK. Besides, I used the default corosync.conf.example and corosync version is 1.4.6.

To analyze more deeply, I tcpdumped the traffic to see the content of messages exchanged between the two nodes, and found the following strange things:
1. Every 50ms (I thinks it is the join timeout):
    Node A sends join message with proclist:A,B,C. faillist:B.
    Node B sends join message with proclist:A,B,C. faillist:A.

2. Every 1250ms(consensus timeout):
    Node A sends join message with proclist:A,B,C. faillist:B,C.
    Node B sends join message with proclist:A,B,C. faillist:A,C.


Something is missing from your tcpdump analysis.  Once the consensus times out, consensus will be met:

Node A will calculate consensus based upon proclist-faillist = A = A received all join messages in consensus list, hence consensus met

Node B will calculate consensus based upon proclist-faillist = B = B recieved all join messages in consensus list, hence consensus met

What I would expect from step 3 is
after 1250ms:
node A will send join message with proclist: A, B, C.  faillist: B,C
Node B will send join message with proclist A, B, C.  faillist: A, C.

Further join messages will contain these sets.  This should lead to

Node A forming a singleton configuration because consensus is agreed
Node B forming a singleton configuration because consensus is agreed

Node A sends merge detect
Node A enters gather and sends join with proclist: A, faillist: empty

Node B sends merge detect
Node B enters gather and sends join with proclist: B, faillist: empty
In the tcpdump result, I can not find neither merge detect message nor join message as above. Maybe, there is no chance for the singleton configuration to sent them out before it was broken by the join messages which has it in fail list from peer.

Node A, B receive proclist from A, B, both enter consensus and form a new ring A, B

You said C was killed.  This leads to the natural question of why it is still in the proc list after each node forms a singleton.

In tcpdump result, I also can not find any join message which has a proclist Without node C. As My assumption below , maybe because it will always be copied historically from the first time that  it was killed.


It should be because both A and B treated each other as failed so that they could not be formed forever and the single node ring is always broken by join messages.

I am not sure the origin why both A and B set each other as failed in join message. I just analyzed the code and found the most possible reason make it happen is network partition. So I made the following assumption about what was happened:

1. Initially, ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL. Node B sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. and create ring(A). Node B sends join message with proclist:A,B,C. faillist:A,C. and create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which sent by node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified join message with proclist:A,B,C. faillist:B. such join message will prevent both A and B from merging.
9. Node A consensus timeout (caused by waiting node C) and sends join message with proclist:A,B,C. faillist:B,C again.


good analysis

Same thing happens on node B, so A and B will dead loop forever in step 7,8 and 9.

If my assumption and analysis is right, then I think it is step 8 that did the wrong thing. Because according to the paper I found at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.4028&rep=rep1&type=pdf , it says: “if a processor receives a join message in the operational state and if the receiver’s identifier is in the join message’s fail list, … then it ignores the join message."

Figure 4.4 doesn't match the text.  I've found in these cases in academic papers, the text takes precedence.

So I create a patch to apply the above algorithm to try to solve the publem:

--- ./corosync-1.4.6-orig/exec/totemsrp.c Wed May 29 14:33:27 2013 UTC
+++ ./corosync-1.4.6/exec/totemsrp.c Wed Nov 6 13:12:30 2013 UTC
@@ -4274,6 +4274,36 @@
  srp_addr_copy_endian_convert (&out->system_from, &in->system_from);
 }
 
+static int ignore_join_under_operational (
+ struct totemsrp_instance *instance,
+ const struct memb_join *memb_join)
+{
+ struct srp_addr *proc_list;
+ struct srp_addr *failed_list;
+ unsigned long long ring_seq;
+
+ proc_list = (struct srp_addr *)memb_join->end_of_memb_join;
+ failed_list = proc_list + memb_join->proc_list_entries;
+ ring_seq = memb_join->ring_seq;
+
+ if (memb_set_subset (&instance->my_id, 1,
+ failed_list, memb_join->failed_list_entries)) {
+ return 1;
+ }
+
+ /* In operational state, my_proc_list is exactly the same as 
+   my_memb_list. */
+
what is the point of the below code?
It was also from the text of the paper. I just brought it altogether. As the paper also said:If a processor receives a join message in the operational state and if the sender's identifier is in the receiver's my_proclist and the join message's ring_seq is less than the receiver's ring sequence number, then it ignores the join message too.

+ if ((memb_set_subset (&memb_join->system_from, 1,
+ instance->my_memb_list,
+ instance->my_memb_entries)) &&
+ (ring_seq < instance->my_ring_id.seq)) {
+ return 1;
+ }
+
+ return 0;
+}
+
 static int message_handler_memb_join (
  struct totemsrp_instance *instance,
  const void *msg,
@@ -4304,7 +4334,9 @@
  }
  switch (instance->memb_state) {
  case MEMB_STATE_OPERATIONAL:
- memb_join_process (instance, memb_join);

if (ignore_join_under_operational(instance, memb_join) == 0) {

+ if (0 == ignore_join_under_operational(instance, memb_join)) {
+ memb_join_process (instance, memb_join);
+ }
  break;
 
  case MEMB_STATE_GATHER:

Currently, I haven’t reproduced the problem in a 3-node cluster, but I have reproduced the “a processor receives a join message in the operational state and the receiver’s identifier is in the join message’s fail list” circumstance in a two-node evniroment, by using the following step:
1. iptables –A INPUT –i eth0 –p udp ! -sport domain –j DROP 
2. usleep 2126000
3. iptables –D INPUT –i eth0 –p udp ! -sport domain –j DROP

In the two-node environment, there is no dead loop issue as in the 3-node one, because there is no consensus timeout caused by the third dead node in step 9. But it can still be used to proof the patch.

Please take a look at this issue, Thanks!


Please use git send-email to send the email.  It allows an easier merging of the patch and attribution of the work.

Thanks, I will resend this patch as soon as I got familiar with git.
  
Regards
-steve


--
Yours,
Jason


_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux