Re: issue about two nodes could not be merged into one ring

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jason,

Nice dig into the code/totem.  Hope you didn't break the bank on red bull :)  I have a few comments inline:

On 11/06/2013 07:16 AM, jason wrote:
Hi All,
I currently encountered a problem that two nodes could not be merged into one ring.
Initially, there were three nodes in a ring, say A, B and C. Then, after killing C, I found that A and B could not be merged forever (I wait at least 4 hours), unless restart at least one of them.
By analyzing the black box log, both A and B have a dead loop in doing the following things:
1. Form a single node ring.
2. The ring is broken by a JOIN message from peer.
3. Try to form a two-node ring but consensus timeout.
4. Go to 1.

I checked the network by using omping and it was OK. Besides, I used the default corosync.conf.example and corosync version is 1.4.6.

To analyze more deeply, I tcpdumped the traffic to see the content of messages exchanged between the two nodes, and found the following strange things:
1. Every 50ms (I thinks it is the join timeout):
    Node A sends join message with proclist:A,B,C. faillist:B.
    Node B sends join message with proclist:A,B,C. faillist:A.

2. Every 1250ms(consensus timeout):
    Node A sends join message with proclist:A,B,C. faillist:B,C.
    Node B sends join message with proclist:A,B,C. faillist:A,C.


Something is missing from your tcpdump analysis.  Once the consensus times out, consensus will be met:

Node A will calculate consensus based upon proclist-faillist = A = A received all join messages in consensus list, hence consensus met

Node B will calculate consensus based upon proclist-faillist = B = B recieved all join messages in consensus list, hence consensus met

What I would expect from step 3 is
after 1250ms:
node A will send join message with proclist: A, B, C.  faillist: B,C
Node B will send join message with proclist A, B, C.  faillist: A, C.

Further join messages will contain these sets.  This should lead to

Node A forming a singleton configuration because consensus is agreed
Node B forming a singleton configuration because consensus is agreed

Node A sends merge detect
Node A enters gather and sends join with proclist: A, faillist: empty

Node B sends merge detect
Node B enters gather and sends join with proclist: B, faillist: empty

Node A, B receive proclist from A, B, both enter consensus and form a new ring A, B

You said C was killed.  This leads to the natural question of why it is still in the proc list after each node forms a singleton.


It should be because both A and B treated each other as failed so that they could not be formed forever and the single node ring is always broken by join messages.

I am not sure the origin why both A and B set each other as failed in join message. I just analyzed the code and found the most possible reason make it happen is network partition. So I made the following assumption about what was happened:

1. Initially, ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL. Node B sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. and create ring(A). Node B sends join message with proclist:A,B,C. faillist:A,C. and create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which sent by node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified join message with proclist:A,B,C. faillist:B. such join message will prevent both A and B from merging.
9. Node A consensus timeout (caused by waiting node C) and sends join message with proclist:A,B,C. faillist:B,C again.


good analysis

Same thing happens on node B, so A and B will dead loop forever in step 7,8 and 9.

If my assumption and analysis is right, then I think it is step 8 that did the wrong thing. Because according to the paper I found at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.4028&rep=rep1&type=pdf , it says: “if a processor receives a join message in the operational state and if the receiver’s identifier is in the join message’s fail list, … then it ignores the join message."

Figure 4.4 doesn't match the text.  I've found in these cases in academic papers, the text takes precedence.

So I create a patch to apply the above algorithm to try to solve the publem:

--- ./corosync-1.4.6-orig/exec/totemsrp.c Wed May 29 14:33:27 2013 UTC
+++ ./corosync-1.4.6/exec/totemsrp.c Wed Nov 6 13:12:30 2013 UTC
@@ -4274,6 +4274,36 @@
  srp_addr_copy_endian_convert (&out->system_from, &in->system_from);
 }
 
+static int ignore_join_under_operational (
+ struct totemsrp_instance *instance,
+ const struct memb_join *memb_join)
+{
+ struct srp_addr *proc_list;
+ struct srp_addr *failed_list;
+ unsigned long long ring_seq;
+
+ proc_list = (struct srp_addr *)memb_join->end_of_memb_join;
+ failed_list = proc_list + memb_join->proc_list_entries;
+ ring_seq = memb_join->ring_seq;
+
+ if (memb_set_subset (&instance->my_id, 1,
+ failed_list, memb_join->failed_list_entries)) {
+ return 1;
+ }
+
+ /* In operational state, my_proc_list is exactly the same as 
+   my_memb_list. */
+
what is the point of the below code?
+ if ((memb_set_subset (&memb_join->system_from, 1,
+ instance->my_memb_list,
+ instance->my_memb_entries)) &&
+ (ring_seq < instance->my_ring_id.seq)) {
+ return 1;
+ }
+
+ return 0;
+}
+
 static int message_handler_memb_join (
  struct totemsrp_instance *instance,
  const void *msg,
@@ -4304,7 +4334,9 @@
  }
  switch (instance->memb_state) {
  case MEMB_STATE_OPERATIONAL:
- memb_join_process (instance, memb_join);

if (ignore_join_under_operational(instance, memb_join) == 0) {

+ if (0 == ignore_join_under_operational(instance, memb_join)) {
+ memb_join_process (instance, memb_join);
+ }
  break;
 
  case MEMB_STATE_GATHER:

Currently, I haven’t reproduced the problem in a 3-node cluster, but I have reproduced the “a processor receives a join message in the operational state and the receiver’s identifier is in the join message’s fail list” circumstance in a two-node evniroment, by using the following step:
1. iptables –A INPUT –i eth0 –p udp ! -sport domain –j DROP 
2. usleep 2126000
3. iptables –D INPUT –i eth0 –p udp ! -sport domain –j DROP

In the two-node environment, there is no dead loop issue as in the 3-node one, because there is no consensus timeout caused by the third dead node in step 9. But it can still be used to proof the patch.

Please take a look at this issue, Thanks!


Please use git send-email to send the email.  It allows an easier merging of the patch and attribution of the work.

Regards
-steve


--
Yours,
Jason


_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux