Hi Steven,
Jason,
Nice dig into the code/totem. Hope you didn't break this
bank on
red bull :) I have a few comments inline:
Well, at least better than the guy from crystal lake ;).
On 11/06/2013 07:16 AM, jason wrote:
Hi All,
I currently encountered a problem that two nodes could not
be merged into one ring.
Initially, there were three nodes in a ring, say A, B and
C. Then, after killing C, I found that A and B could not be
merged forever (I wait at least 4 hours), unless restart at
least one of them.
By analyzing the black box log, both A and B have a dead
loop in doing the following things:
1. Form a single node ring.
2. The ring is broken by a JOIN message from peer.
3. Try to form a two-node ring but consensus timeout.
4. Go to 1.
I checked the network by using omping and it was OK.
Besides, I used the default corosync.conf.example and corosync
version is 1.4.6.
To analyze more deeply, I tcpdumped the traffic to see the
content of messages exchanged between the two nodes, and found
the following strange things:
1. Every 50ms (I thinks it is the join timeout):
Node A sends join message with proclist:A,B,C.
faillist:B.
Node B sends join message with proclist:A,B,C.
faillist:A.
2. Every 1250ms(consensus timeout):
Node A sends join message with proclist:A,B,C.
faillist:B,C.
Node B sends join message with proclist:A,B,C.
faillist:A,C.
Something is missing from your tcpdump analysis. Once the consensus
times out, consensus will be met:
Node A will calculate consensus based upon proclist-faillist = A = A
received all join messages in consensus list, hence consensus met
Node B will calculate consensus based upon proclist-faillist = B = B
recieved all join messages in consensus list, hence consensus met
What I would expect from step 3 is
after 1250ms:
node A will send join message with proclist: A, B, C. faillist: B,C
Node B will send join message with proclist A, B, C. faillist: A,
C.
Further join messages will contain these sets. This should lead to
Node A forming a singleton configuration because consensus is agreed
Node B forming a singleton configuration because consensus is agreed
Node A sends merge detect
Node A enters gather and sends join with proclist: A, faillist:
empty
Node B sends merge detect
Node B enters gather and sends join with proclist: B, faillist:
empty
In the tcpdump result, I can not find neither merge detect message nor join message as above. Maybe, there is no chance for the singleton configuration to sent them out before it was broken by the join messages which has it in fail list from peer.
Node A, B receive proclist from A, B, both enter consensus and form
a new ring A, B
You said C was killed. This leads to the natural question of why it
is still in the proc list after each node forms a singleton.
In tcpdump result, I also can not find any join message which has a proclist Without node C. As My assumption below , maybe because it will always be copied historically from the first time that it was killed.
It should be because both A and B treated each other as
failed so that they could not be formed forever and the single
node ring is always broken by join messages.
I am not sure the origin why both A and B set each other as
failed in join message. I just analyzed the code and found the
most possible reason make it happen is network partition. So I
made the following assumption about what was happened:
1. Initially, ring(A,B,C).
2. A and B network partition, "in the same time", C is
down.
3. Node A sends join message with proclist:A,B,C.
faillist:NULL. Node B sends join message with proclist:A,B,C.
faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C.
faillist:B,C. and create ring(A). Node B sends join message
with proclist:A,B,C. faillist:A,C. and create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which
sent by node B is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified
join message with proclist:A,B,C. faillist:B. such join
message will prevent both A and B from merging.
9. Node A consensus timeout (caused by waiting node C) and
sends join message with proclist:A,B,C. faillist:B,C again.
good analysis
Same thing happens on node B, so A and B will dead loop
forever in step 7,8 and 9.
Figure 4.4 doesn't match the text. I've found in these cases in
academic papers, the text takes precedence.
So I create a patch to apply the above algorithm to try to
solve the publem:
--- ./corosync-1.4.6-orig/exec/totemsrp.c Wed
May 29 14:33:27 2013 UTC
+++ ./corosync-1.4.6/exec/totemsrp.c Wed
Nov 6 13:12:30 2013 UTC
@@ -4274,6 +4274,36 @@
srp_addr_copy_endian_convert
(&out->system_from, &in->system_from);
}
+static int ignore_join_under_operational (
+ struct totemsrp_instance
*instance,
+ const struct memb_join
*memb_join)
+{
+ struct srp_addr *proc_list;
+ struct srp_addr *failed_list;
+ unsigned long long ring_seq;
+
+ proc_list = (struct srp_addr
*)memb_join->end_of_memb_join;
+ failed_list = proc_list +
memb_join->proc_list_entries;
+ ring_seq =
memb_join->ring_seq;
+
+ if (memb_set_subset
(&instance->my_id, 1,
+ failed_list,
memb_join->failed_list_entries)) {
+ return 1;
+ }
+
+ /* In operational state,
my_proc_list is exactly the same as
+ my_memb_list. */
+
what is the point of the below code?
It was also from the text of the paper. I just brought it altogether. As the paper also said:If a processor receives a join message in the operational state and if the sender's identifier is in the receiver's my_proclist and the join message's ring_seq is less than the receiver's ring sequence number, then it ignores the join message too.
+ if ((memb_set_subset
(&memb_join->system_from, 1,
+ instance->my_memb_list,
+ instance->my_memb_entries))
&&
+ (ring_seq <
instance->my_ring_id.seq)) {
+ return 1;
+ }
+
+ return 0;
+}
+
static int message_handler_memb_join (
struct totemsrp_instance
*instance,
const void *msg,
@@ -4304,7 +4334,9 @@
}
switch (instance->memb_state)
{
case MEMB_STATE_OPERATIONAL:
- memb_join_process (instance,
memb_join);
if (ignore_join_under_operational(instance, memb_join) == 0) {
+ if (0 ==
ignore_join_under_operational(instance, memb_join)) {
+ memb_join_process (instance,
memb_join);
+ }
break;
case MEMB_STATE_GATHER:
Currently, I haven’t reproduced the problem in a 3-node
cluster, but I have reproduced the “a processor receives a
join message in the operational state and the receiver’s
identifier is in the join message’s fail list” circumstance in
a two-node evniroment, by using the following step:
1. iptables –A INPUT –i eth0 –p udp ! -sport domain –j
DROP
2. usleep 2126000
3. iptables –D INPUT –i eth0 –p udp ! -sport domain –j DROP
In the two-node environment, there is no dead loop issue as
in the 3-node one, because there is no consensus timeout
caused by the third dead node in step 9. But it can still be
used to proof the patch.
Please take a look at this issue, Thanks!
Please use git send-email to send the email. It allows an easier
merging of the patch and attribution of the work.
Thanks, I will resend this patch as soon as I got familiar with git. |