Jason,
Nice dig into the code/totem. Hope you didn't break the bank on
red bull :) I have a few comments inline:
On 11/06/2013 07:16 AM, jason wrote:
Hi All,
I currently encountered a problem that two nodes could not
be merged into one ring.
Initially, there were three nodes in a ring, say A, B and
C. Then, after killing C, I found that A and B could not be
merged forever (I wait at least 4 hours), unless restart at
least one of them.
By analyzing the black box log, both A and B have a dead
loop in doing the following things:
1. Form a single node ring.
2. The ring is broken by a JOIN message from peer.
3. Try to form a two-node ring but consensus timeout.
4. Go to 1.
I checked the network by using omping and it was OK.
Besides, I used the default corosync.conf.example and corosync
version is 1.4.6.
To analyze more deeply, I tcpdumped the traffic to see the
content of messages exchanged between the two nodes, and found
the following strange things:
1. Every 50ms (I thinks it is the join timeout):
Node A sends join message with proclist:A,B,C.
faillist:B.
Node B sends join message with proclist:A,B,C.
faillist:A.
2. Every 1250ms(consensus timeout):
Node A sends join message with proclist:A,B,C.
faillist:B,C.
Node B sends join message with proclist:A,B,C.
faillist:A,C.
Something is missing from your tcpdump analysis. Once the consensus
times out, consensus will be met:
Node A will calculate consensus based upon proclist-faillist = A = A
received all join messages in consensus list, hence consensus met
Node B will calculate consensus based upon proclist-faillist = B = B
recieved all join messages in consensus list, hence consensus met
What I would expect from step 3 is
after 1250ms:
node A will send join message with proclist: A, B, C. faillist: B,C
Node B will send join message with proclist A, B, C. faillist: A,
C.
Further join messages will contain these sets. This should lead to
Node A forming a singleton configuration because consensus is agreed
Node B forming a singleton configuration because consensus is agreed
Node A sends merge detect
Node A enters gather and sends join with proclist: A, faillist:
empty
Node B sends merge detect
Node B enters gather and sends join with proclist: B, faillist:
empty
Node A, B receive proclist from A, B, both enter consensus and form
a new ring A, B
You said C was killed. This leads to the natural question of why it
is still in the proc list after each node forms a singleton.
It should be because both A and B treated each other as
failed so that they could not be formed forever and the single
node ring is always broken by join messages.
I am not sure the origin why both A and B set each other as
failed in join message. I just analyzed the code and found the
most possible reason make it happen is network partition. So I
made the following assumption about what was happened:
1. Initially, ring(A,B,C).
2. A and B network partition, "in the same time", C is
down.
3. Node A sends join message with proclist:A,B,C.
faillist:NULL. Node B sends join message with proclist:A,B,C.
faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C.
faillist:B,C. and create ring(A). Node B sends join message
with proclist:A,B,C. faillist:A,C. and create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which
sent by node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified
join message with proclist:A,B,C. faillist:B. such join
message will prevent both A and B from merging.
9. Node A consensus timeout (caused by waiting node C) and
sends join message with proclist:A,B,C. faillist:B,C again.
good analysis
Same thing happens on node B, so A and B will dead loop
forever in step 7,8 and 9.
Figure 4.4 doesn't match the text. I've found in these cases in
academic papers, the text takes precedence.
So I create a patch to apply the above algorithm to try to
solve the publem:
--- ./corosync-1.4.6-orig/exec/totemsrp.c Wed
May 29 14:33:27 2013 UTC
+++ ./corosync-1.4.6/exec/totemsrp.c Wed
Nov 6 13:12:30 2013 UTC
@@ -4274,6 +4274,36 @@
srp_addr_copy_endian_convert
(&out->system_from, &in->system_from);
}
+static int ignore_join_under_operational (
+ struct totemsrp_instance
*instance,
+ const struct memb_join
*memb_join)
+{
+ struct srp_addr *proc_list;
+ struct srp_addr *failed_list;
+ unsigned long long ring_seq;
+
+ proc_list = (struct srp_addr
*)memb_join->end_of_memb_join;
+ failed_list = proc_list +
memb_join->proc_list_entries;
+ ring_seq =
memb_join->ring_seq;
+
+ if (memb_set_subset
(&instance->my_id, 1,
+ failed_list,
memb_join->failed_list_entries)) {
+ return 1;
+ }
+
+ /* In operational state,
my_proc_list is exactly the same as
+ my_memb_list. */
+
what is the point of the below code?
+ if ((memb_set_subset
(&memb_join->system_from, 1,
+ instance->my_memb_list,
+ instance->my_memb_entries))
&&
+ (ring_seq <
instance->my_ring_id.seq)) {
+ return 1;
+ }
+
+ return 0;
+}
+
static int message_handler_memb_join (
struct totemsrp_instance
*instance,
const void *msg,
@@ -4304,7 +4334,9 @@
}
switch (instance->memb_state)
{
case MEMB_STATE_OPERATIONAL:
- memb_join_process (instance,
memb_join);
if (ignore_join_under_operational(instance, memb_join) == 0) {
+ if (0 ==
ignore_join_under_operational(instance, memb_join)) {
+ memb_join_process (instance,
memb_join);
+ }
break;
case MEMB_STATE_GATHER:
Currently, I haven’t reproduced the problem in a 3-node
cluster, but I have reproduced the “a processor receives a
join message in the operational state and the receiver’s
identifier is in the join message’s fail list” circumstance in
a two-node evniroment, by using the following step:
1. iptables –A INPUT –i eth0 –p udp ! -sport domain –j
DROP
2. usleep 2126000
3. iptables –D INPUT –i eth0 –p udp ! -sport domain –j DROP
In the two-node environment, there is no dead loop issue as
in the 3-node one, because there is no consensus timeout
caused by the third dead node in step 9. But it can still be
used to proof the patch.
Please take a look at this issue, Thanks!
Please use git send-email to send the email. It allows an easier
merging of the patch and attribution of the work.
Regards
-steve
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss
|