Re: issue about two nodes could not be merged into one ring

Steven Dake <sdake@xxxxxxxxxx> · Wed, 06 Nov 2013 09:36:58 -0700



    Jason,

      
      Nice dig into the code/totem.  Hope you didn't break the bank on
      red bull :)  I have a few comments inline:

      
      On 11/06/2013 07:16 AM, jason wrote:

    
        Hi All,
        I currently encountered a problem that two nodes could not
          be merged into one ring.
        Initially, there were three nodes in a ring, say A, B and
          C. Then, after killing C, I found that A and B could not be
          merged forever (I wait at least 4 hours), unless restart at
          least one of them.
        By analyzing the black box log, both A and B have a dead
          loop in doing the following things:
        1. Form a single node ring.
        2. The ring is broken by a JOIN message from peer.
        3. Try to form a two-node ring but consensus timeout.
        4. Go to 1.
        

        I checked the network by using omping and it was OK.
          Besides, I used the default corosync.conf.example and corosync
          version is 1.4.6.
        

        To analyze more deeply, I tcpdumped the traffic to see the
          content of messages exchanged between the two nodes, and found
          the following strange things:
        1. Every 50ms (I thinks it is the join timeout):
            Node A sends join message with proclist:A,B,C.
          faillist:B.
            Node B sends join message with proclist:A,B,C.
          faillist:A.
        

          2. Every 1250ms(consensus timeout):
            Node A sends join message with proclist:A,B,C.
          faillist:B,C.
            Node B sends join message with proclist:A,B,C.
          faillist:A,C.
        

    Something is missing from your tcpdump analysis.  Once the consensus
    times out, consensus will be met:

    
    Node A will calculate consensus based upon proclist-faillist = A = A
    received all join messages in consensus list, hence consensus met

    
    Node B will calculate consensus based upon proclist-faillist = B = B
    recieved all join messages in consensus list, hence consensus met

    
    What I would expect from step 3 is

    after 1250ms:

    node A will send join message with proclist: A, B, C.  faillist: B,C

    Node B will send join message with proclist A, B, C.  faillist: A,
    C.

    
    Further join messages will contain these sets.  This should lead to

    
    Node A forming a singleton configuration because consensus is agreed

    Node B forming a singleton configuration because consensus is agreed

    
    Node A sends merge detect

    Node A enters gather and sends join with proclist: A, faillist:
    empty

    
    Node B sends merge detect

    Node B enters gather and sends join with proclist: B, faillist:
    empty

    
    Node A, B receive proclist from A, B, both enter consensus and form
    a new ring A, B

    
    You said C was killed.  This leads to the natural question of why it
    is still in the proc list after each node forms a singleton.

    
        It should be because both A and B treated each other as
          failed so that they could not be formed forever and the single
          node ring is always broken by join messages.
        

        I am not sure the origin why both A and B set each other as
          failed in join message. I just analyzed the code and found the
          most possible reason make it happen is network partition. So I
          made the following assumption about what was happened:
        

        1. Initially, ring(A,B,C).
        2. A and B network partition, "in the same time", C is
          down.
        3. Node A sends join message with proclist:A,B,C.
          faillist:NULL. Node B sends join message with proclist:A,B,C.
          faillist:NULL.
        4. Both A and B consensus timeout due to network partition.
        5. A and B network remerged.
        6. Node A sends join message with proclist:A,B,C.
          faillist:B,C. and create ring(A). Node B sends join message
          with proclist:A,B,C. faillist:A,C. and create ring(B).
        7. Say join message with proclist:A,B,C. faillist:A,C which
          sent by node B is received by node A because network remerged.
        8. Node A shifts to gather state and send out a modified
          join message with proclist:A,B,C. faillist:B. such join
          message will prevent both A and B from merging.
      
    
        9. Node A consensus timeout (caused by waiting node C) and
          sends join message with proclist:A,B,C. faillist:B,C again.
        

    good analysis

    
        Same thing happens on node B, so A and B will dead loop
          forever in step 7,8 and 9.
        

        If my assumption and analysis is right, then I think it is
          step 8 that did the wrong thing. Because according to the
          paper I found at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.4028&rep=rep1&type=pdf
          , it says: “if a processor receives a join message in the
          operational state and if the receiver’s identifier is in the
          join message’s fail list, … then it ignores the join message."
        

    Figure 4.4 doesn't match the text.  I've found in these cases in
    academic papers, the text takes precedence.

    
        So I create a patch to apply the above algorithm to try to
          solve the publem:
        

        --- ./corosync-1.4.6-orig/exec/totemsrp.c Wed
          May 29 14:33:27 2013 UTC
        +++ ./corosync-1.4.6/exec/totemsrp.c Wed
          Nov 6 13:12:30 2013 UTC
        
          @@ -4274,6 +4274,36 @@
            srp_addr_copy_endian_convert
            (&out->system_from, &in->system_from);
           }
           
          +static int ignore_join_under_operational (
          + struct totemsrp_instance
            *instance,
          + const struct memb_join
            *memb_join)
          +{
          + struct srp_addr *proc_list;
          + struct srp_addr *failed_list;
          + unsigned long long ring_seq;
          +
          + proc_list = (struct srp_addr
            *)memb_join->end_of_memb_join;
          + failed_list = proc_list +
            memb_join->proc_list_entries;
          + ring_seq =
            memb_join->ring_seq;
          +
          + if (memb_set_subset
            (&instance->my_id, 1,
          + failed_list,
            memb_join->failed_list_entries)) {
          + return 1;
          + }
          +
          + /* In operational state,
            my_proc_list is exactly the same as 
          +    my_memb_list. */
          +
        
      
    what is the point of the below code?

    
          + if ((memb_set_subset
            (&memb_join->system_from, 1,
          + instance->my_memb_list,
          + instance->my_memb_entries))
            &&
          + (ring_seq <
            instance->my_ring_id.seq)) {
          + return 1;
          + }
          +
          + return 0;
          +}
          +
           static int message_handler_memb_join (
            struct totemsrp_instance
            *instance,
            const void *msg,
          @@ -4304,7 +4334,9 @@
            }
            switch (instance->memb_state)
            {
            case MEMB_STATE_OPERATIONAL:
          - memb_join_process (instance,
            memb_join);
        
      
    if (ignore_join_under_operational(instance, memb_join) == 0) {

    
          + if (0 ==
            ignore_join_under_operational(instance, memb_join)) {
          + memb_join_process (instance,
            memb_join);
          + }
            break;
           
            case MEMB_STATE_GATHER:
        
        
        Currently, I haven’t reproduced the problem in a 3-node
          cluster, but I have reproduced the “a processor receives a
          join message in the operational state and the receiver’s
          identifier is in the join message’s fail list” circumstance in
          a two-node evniroment, by using the following step:
        1. iptables –A INPUT –i eth0 –p udp ! -sport domain –j
          DROP 
        2. usleep 2126000
        3. iptables –D INPUT –i eth0 –p udp ! -sport domain –j DROP
        

        In the two-node environment, there is no dead loop issue as
          in the 3-node one, because there is no consensus timeout
          caused by the third dead node in step 9. But it can still be
          used to proof the patch.
        

        Please take a look at this issue, Thanks!
        

    Please use git send-email to send the email.  It allows an easier
    merging of the patch and attribution of the work.

    
    Regards

    -steve

    
        -- 

        Yours,

        Jason
      
      
      _______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

    
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss