Re: issue about two nodes could not be merged into one ring

Jason HU <huzhijiang@xxxxxxxxx> · Fri, 8 Nov 2013 11:36:39 +0800

Hi Steven,

On 2013年11月7日, at 0:36, Steven Dake <sdake@xxxxxxxxxx> wrote:

    Jason,

      Nice dig into the code/totem.  Hope you didn't break this
bank on
      red bull :)  I have a few comments inline:
Well, at least better than the guy from crystal lake ;).

      On 11/06/2013 07:16 AM, jason wrote:

        Hi All,
        I currently encountered a problem that two nodes could not
          be merged into one ring.
        Initially, there were three nodes in a ring, say A, B and
          C. Then, after killing C, I found that A and B could not be
          merged forever (I wait at least 4 hours), unless restart at
          least one of them.
        By analyzing the black box log, both A and B have a dead
          loop in doing the following things:
        1. Form a single node ring.
        2. The ring is broken by a JOIN message from peer.
        3. Try to form a two-node ring but consensus timeout.
        4. Go to 1.

        I checked the network by using omping and it was OK.
          Besides, I used the default corosync.conf.example and corosync
          version is 1.4.6.

        To analyze more deeply, I tcpdumped the traffic to see the
          content of messages exchanged between the two nodes, and found
          the following strange things:
        1. Every 50ms (I thinks it is the join timeout):
            Node A sends join message with proclist:A,B,C.
          faillist:B.
            Node B sends join message with proclist:A,B,C.
          faillist:A.

          2. Every 1250ms(consensus timeout):
            Node A sends join message with proclist:A,B,C.
          faillist:B,C.
            Node B sends join message with proclist:A,B,C.
          faillist:A,C.

    Something is missing from your tcpdump analysis.  Once the consensus
    times out, consensus will be met:

    Node A will calculate consensus based upon proclist-faillist = A = A
    received all join messages in consensus list, hence consensus met

    Node B will calculate consensus based upon proclist-faillist = B = B
    recieved all join messages in consensus list, hence consensus met

    What I would expect from step 3 is

    after 1250ms:

    node A will send join message with proclist: A, B, C.  faillist: B,C

    Node B will send join message with proclist A, B, C.  faillist: A,
    C.

    Further join messages will contain these sets.  This should lead to

    Node A forming a singleton configuration because consensus is agreed

    Node B forming a singleton configuration because consensus is agreed

    Node A sends merge detect

    Node A enters gather and sends join with proclist: A, faillist:
    empty

    Node B sends merge detect

    Node B enters gather and sends join with proclist: B, faillist:
    empty
In the tcpdump result, I can not find neither merge detect message nor join message as above. Maybe, there is no chance for the singleton configuration to sent them out before it was broken by the join messages which has it in fail list from peer.

    Node A, B receive proclist from A, B, both enter consensus and form
    a new ring A, B

    You said C was killed.  This leads to the natural question of why it
    is still in the proc list after each node forms a singleton.

In tcpdump result, I also can not find any join message which has a proclist Without node C. As My assumption below , maybe because it will always be copied historically from the first time that  it was killed.

        It should be because both A and B treated each other as
          failed so that they could not be formed forever and the single
          node ring is always broken by join messages.

        I am not sure the origin why both A and B set each other as
          failed in join message. I just analyzed the code and found the
          most possible reason make it happen is network partition. So I
          made the following assumption about what was happened:

        1. Initially, ring(A,B,C).
        2. A and B network partition, "in the same time", C is
          down.
        3. Node A sends join message with proclist:A,B,C.
          faillist:NULL. Node B sends join message with proclist:A,B,C.
          faillist:NULL.
        4. Both A and B consensus timeout due to network partition.
        5. A and B network remerged.
        6. Node A sends join message with proclist:A,B,C.
          faillist:B,C. and create ring(A). Node B sends join message
          with proclist:A,B,C. faillist:A,C. and create ring(B).
        7. Say join message with proclist:A,B,C. faillist:A,C which
          sent by node B is received by node A because network remerged.
        8. Node A shifts to gather state and send out a modified
          join message with proclist:A,B,C. faillist:B. such join
          message will prevent both A and B from merging.

        9. Node A consensus timeout (caused by waiting node C) and
          sends join message with proclist:A,B,C. faillist:B,C again.

    good analysis

        Same thing happens on node B, so A and B will dead loop
          forever in step 7,8 and 9.

        If my assumption and analysis is right, then I think it is
          step 8 that did the wrong thing. Because according to the
          paper I found at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.4028&rep=rep1&type=pdf
          , it says: “if a processor receives a join message in the
          operational state and if the receiver’s identifier is in the
          join message’s fail list, … then it ignores the join message."

    Figure 4.4 doesn't match the text.  I've found in these cases in
    academic papers, the text takes precedence.

        So I create a patch to apply the above algorithm to try to
          solve the publem:

        --- ./corosync-1.4.6-orig/exec/totemsrp.c Wed
          May 29 14:33:27 2013 UTC
        +++ ./corosync-1.4.6/exec/totemsrp.c Wed
          Nov 6 13:12:30 2013 UTC

          @@ -4274,6 +4274,36 @@
            srp_addr_copy_endian_convert
            (&out->system_from, &in->system_from);
           }

          +static int ignore_join_under_operational (
          + struct totemsrp_instance
            *instance,
          + const struct memb_join
            *memb_join)
          +{
          + struct srp_addr *proc_list;
          + struct srp_addr *failed_list;
          + unsigned long long ring_seq;
          +
          + proc_list = (struct srp_addr
            *)memb_join->end_of_memb_join;
          + failed_list = proc_list +
            memb_join->proc_list_entries;
          + ring_seq =
            memb_join->ring_seq;
          +
          + if (memb_set_subset
            (&instance->my_id, 1,
          + failed_list,
            memb_join->failed_list_entries)) {
          + return 1;
          + }
          +
          + /* In operational state,
            my_proc_list is exactly the same as 
          +    my_memb_list. */
          +

    what is the point of the below code?
It was also from the text of the paper. I just brought it altogether. As the paper also said:If a processor receives a join message in the operational state and if the sender's identifier is in the receiver's my_proclist and the join message's ring_seq is less than the receiver's ring sequence number, then it ignores the join message too.

          + if ((memb_set_subset
            (&memb_join->system_from, 1,
          + instance->my_memb_list,
          + instance->my_memb_entries))
            &&
          + (ring_seq <
            instance->my_ring_id.seq)) {
          + return 1;
          + }
          +
          + return 0;
          +}
          +
           static int message_handler_memb_join (
            struct totemsrp_instance
            *instance,
            const void *msg,
          @@ -4304,7 +4334,9 @@
            }
            switch (instance->memb_state)
            {
            case MEMB_STATE_OPERATIONAL:
          - memb_join_process (instance,
            memb_join);

    if (ignore_join_under_operational(instance, memb_join) == 0) {

          + if (0 ==
            ignore_join_under_operational(instance, memb_join)) {
          + memb_join_process (instance,
            memb_join);
          + }
            break;

            case MEMB_STATE_GATHER:

        Currently, I haven’t reproduced the problem in a 3-node
          cluster, but I have reproduced the “a processor receives a
          join message in the operational state and the receiver’s
          identifier is in the join message’s fail list” circumstance in
          a two-node evniroment, by using the following step:
        1. iptables –A INPUT –i eth0 –p udp ! -sport domain –j
          DROP 
        2. usleep 2126000
        3. iptables –D INPUT –i eth0 –p udp ! -sport domain –j DROP

        In the two-node environment, there is no dead loop issue as
          in the 3-node one, because there is no consensus timeout
          caused by the third dead node in step 9. But it can still be
          used to proof the patch.

        Please take a look at this issue, Thanks!

    Please use git send-email to send the email.  It allows an easier
    merging of the patch and attribution of the work.

Thanks, I will resend this patch as soon as I got familiar with git.

    Regards

    -steve

        -- 

        Yours,

        Jason

      _______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss