issue about two nodes could not be merged into one ring

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,
I currently encountered a problem that two nodes could not be merged into one ring.
Initially, there were three nodes in a ring, say A, B and C. Then, after killing C, I found that A and B could not be merged forever (I wait at least 4 hours), unless restart at least one of them.
By analyzing the black box log, both A and B have a dead loop in doing the following things:
1. Form a single node ring.
2. The ring is broken by a JOIN message from peer.
3. Try to form a two-node ring but consensus timeout.
4. Go to 1.

I checked the network by using omping and it was OK. Besides, I used the default corosync.conf.example and corosync version is 1.4.6.

To analyze more deeply, I tcpdumped the traffic to see the content of messages exchanged between the two nodes, and found the following strange things:
1. Every 50ms (I thinks it is the join timeout):
    Node A sends join message with proclist:A,B,C. faillist:B.
    Node B sends join message with proclist:A,B,C. faillist:A.

2. Every 1250ms(consensus timeout):
    Node A sends join message with proclist:A,B,C. faillist:B,C.
    Node B sends join message with proclist:A,B,C. faillist:A,C.

It should be because both A and B treated each other as failed so that they could not be formed forever and the single node ring is always broken by join messages.

I am not sure the origin why both A and B set each other as failed in join message. I just analyzed the code and found the most possible reason make it happen is network partition. So I made the following assumption about what was happened:

1. Initially, ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL. Node B sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. and create ring(A). Node B sends join message with proclist:A,B,C. faillist:A,C. and create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which sent by node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified join message with proclist:A,B,C. faillist:B. such join message will prevent both A and B from merging.
9. Node A consensus timeout (caused by waiting node C) and sends join message with proclist:A,B,C. faillist:B,C again.

Same thing happens on node B, so A and B will dead loop forever in step 7,8 and 9.

If my assumption and analysis is right, then I think it is step 8 that did the wrong thing. Because according to the paper I found at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.4028&rep=rep1&type=pdf , it says: “if a processor receives a join message in the operational state and if the receiver’s identifier is in the join message’s fail list, … then it ignores the join message."

So I create a patch to apply the above algorithm to try to solve the publem:

--- ./corosync-1.4.6-orig/exec/totemsrp.c Wed May 29 14:33:27 2013 UTC
+++ ./corosync-1.4.6/exec/totemsrp.c Wed Nov 6 13:12:30 2013 UTC
@@ -4274,6 +4274,36 @@
  srp_addr_copy_endian_convert (&out->system_from, &in->system_from);
 }
 
+static int ignore_join_under_operational (
+ struct totemsrp_instance *instance,
+ const struct memb_join *memb_join)
+{
+ struct srp_addr *proc_list;
+ struct srp_addr *failed_list;
+ unsigned long long ring_seq;
+
+ proc_list = (struct srp_addr *)memb_join->end_of_memb_join;
+ failed_list = proc_list + memb_join->proc_list_entries;
+ ring_seq = memb_join->ring_seq;
+
+ if (memb_set_subset (&instance->my_id, 1,
+ failed_list, memb_join->failed_list_entries)) {
+ return 1;
+ }
+
+ /* In operational state, my_proc_list is exactly the same as 
+   my_memb_list. */
+
+ if ((memb_set_subset (&memb_join->system_from, 1,
+ instance->my_memb_list,
+ instance->my_memb_entries)) &&
+ (ring_seq < instance->my_ring_id.seq)) {
+ return 1;
+ }
+
+ return 0;
+}
+
 static int message_handler_memb_join (
  struct totemsrp_instance *instance,
  const void *msg,
@@ -4304,7 +4334,9 @@
  }
  switch (instance->memb_state) {
  case MEMB_STATE_OPERATIONAL:
- memb_join_process (instance, memb_join);
+ if (0 == ignore_join_under_operational(instance, memb_join)) {
+ memb_join_process (instance, memb_join);
+ }
  break;
 
  case MEMB_STATE_GATHER:

Currently, I haven’t reproduced the problem in a 3-node cluster, but I have reproduced the “a processor receives a join message in the operational state and the receiver’s identifier is in the join message’s fail list” circumstance in a two-node evniroment, by using the following step:
1. iptables –A INPUT –i eth0 –p udp ! -sport domain –j DROP 
2. usleep 2126000
3. iptables –D INPUT –i eth0 –p udp ! -sport domain –j DROP

In the two-node environment, there is no dead loop issue as in the 3-node one, because there is no consensus timeout caused by the third dead node in step 9. But it can still be used to proof the patch.

Please take a look at this issue, Thanks!



--
Yours,
Jason
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux