Re: automatic membership discovery

Patrick Hemmer <corosync@xxxxxxxxxxxxxxx> · Fri, 20 Jun 2014 10:53:10 -0400



      From: Jan Friesse <jfriesse@xxxxxxxxxx>
      Sent:  2014-06-20 04:13:42 EDT 
      To: Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>,
        discuss@xxxxxxxxxxxx
      Subject: Re:  automatic membership
        discovery
      

    Patrick,
      

      Now let's say you really cannot use multicast (what is sadly
      highly probable in cloud environment).
      

      First thing I've totally didn't got is, how whole thing can work
      (reliable) without persistent node list.
      

    There would be a node list, but lets say it's expensive to obtain,
    such as having to make a remote call to an external service. So it
    can be done, but it can't be used to detect nodes joining and
    leaving (polling based, not push).

    
      *From: *Jan Friesse
        <jfriesse@xxxxxxxxxx>
        

        *Sent: * 2014-06-19 09:50:17 EDT
        

        *To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>,
        discuss@xxxxxxxxxxxx
        

        *Subject: *Re:  automatic membership discovery
        

        Patrick,
          

          so just to recapitulate your idea. Let's say you have cluster
          with 2
          

          nodes. Now, you will decide to add third node. Your idea is
          about
          

          properly configure 3rd node (so if we would distribute that
          config file,
          

          call reload on every node, everything would work), in other
          words, add
          

          3rd node ONLY to config file on 3rd node and then start
          corosync. Other
          

          nodes will just accept node, add it to their membership (and
          probably
          

          some kind of automatically generated persistent list of
          nodes). Do I
          

          understand it correctly?
          

        I hadn't considered a persistent storage of nodes as a
        requirement. But
        

      Ok. Now I'm totally lost. I cannot imagine how this can work
      WITHOUT persistent storage of nodes? I mean, let's say you have 5
      nodes. If I understood it correctly, their config file will
      probably looks like:
      

      1st node - 1 node (only itself)
      

      2nd node - 2 nodes (node 1 and node 2)
      

      3rd node - 3 nodes (node 1, node 2, node 3)
      

      ...
      

      Then everything is ok. But what if user will decide to have config
      with following content:
      

      1st node - 1 node (only itself)
      

      2nd node - 2 nodes (node 1 and node 2)
      

      3rd node - 2 nodes (node 2 and node 3)
      

      4th node - 2 nodes (node 3 and node 4)
      

      5th node - 2 nodes (node 4 and node 5)
      

      Such config is perfectly valid and when executing nodes in correct
      order you have 5 nodes cluster, right? Now let's say that cluster
      is stopped. You will enter iptables blocking rule so node 1 sees
      node 2 (but no other nodes), node 2 sees node 1 (but no other
      nodes), node 3 sees node 4 (but no other nodes) and node 4 sees
      node 3 (but no other nodes).
      

      You will start node 1 - 4 and ..., you have TWO perfectly quorate
      clusters, right?
      

    Sorry, I think I see where the confusion is coming in. We can set a
    rule such that when corosync starts, it either starts with only
    itself in the node list (either waiting to be contacted, or for some
    external thing to populate it), or it starts with a full nodelist.

    
      if you wanted to persist the discovered
        nodes, you could have something
        

        (whether corosync, or an external tool) watch the cmap nodelist,
        and
        

        write out to a file when the nodelist changes.
        

      Honestly, I would rather like to talk about use case/higher level
      rather then concrete implementation. I mean, concrete
      implementation make us focused on one way, but there may be other
      ways. So something like storing/not storing nodes, ...
      

      I didn't consider it a requirement as I
        considered the possible
        

        scenarios that would result in a split brain to be near
        impossible. For
        

        example, if you just have a config file where the only node is
        itself,
        

        when it comes up, it could be made such that it doesn't get
        consensus
        

        until it can contact another node. When it does, that other node
        would
        

        share the quorum info, and perhaps even the nodelist. In the
        event that
        

        any number of nodes fail, the last_man_standing behavior will
        keep the
        

        cluster from going split brain (a node will only be removed from
        the
        

        nodelist if it leaves gracefully, or cluster maintains quorum
        for the
        

        duration of the LMS window).
        

        Basically the only scenario I can think of that could result in
        a split
        

        brain is if 2 nodes shut down without a persistent nodelist, and
        then
        

        started back up and were somehow told about each other, but not
        the rest
        

        of the cluster.
        

        In fact persistent storage might even be a problem. If a node
        goes down,
        

        and while it's down another node leaves the cluster, when it
        comes back
        

        up, it won't know that node is gone. Though you could solve this
        by
        

        obtaining the nodelist from the rest of the cluster (if the rest
        of the
        

        cluster is quorate).
        

        Basically:
        

        * A node cannot be quorate by itself.
        

      This is bad requirement. One node cluster is weird (but still
      used), but 2 node cluster (what is actually probably most used
      scenario) where one of node is in maintenance mode is perfectly
      ok. Such cluster HAVE to be quorate.
      

    This wouldn't apply to all usages of corosync. This would be an
    operational mode corosync can be in, like two-node is.

    
      * Corosync will add any node that contacts
        it to its own node list.
        

        * Upon join, the side of the cluster that is quorate will send
        its
        

        quorum information (expected votes, current votes, etc) to the
        inquorate
        

        side.
        

      This is already happening.
      

    Except for things like downscaling.

    
      * (uncertain) If quorate, corosync may
        share the nodelist with the rest
        

        of the cluster (the new node learns existing nodes &
        existing nodes
        

        learn the new node without it contacting them).
        

        * If a node leaves gracefully, it will be removed from the
        nodelist.
        

        * If a node leaves ungracefully, it will be removed if the
        cluster
        

        remains quorate for the duration of last man standing window.
        

          Because if so, I believe it would mean also change config
          file, simply
          

          to keep them in sync. And honestly, keeping config file is for
          sure a
          

          way I would like to go, but that way is very hard. Every
          single thing
          

          must be very well defined (like what is synchronized and what
          is not).
          

        Yes, I wouldn't consider removing the config file. Though one
        

        possibility might be keeping the node list separate from the
        config
        

        file, and letting corosync update that.
        

        As simple as the idea is, it may indeed be that this isn't the
        direction
        

        corosync should go. Traditionally corosync has been geared more
        towards
        

        static clusters that don't change often. But with cloud
        computing
        

      corosync is not designed for static clusters. Actually it's pretty
      opposite. UDPU is newer mode. Original UDP (which still exists, is
      still supported and still default) handles dynamic clusters very
      well (as long as HW is able to do multicast).

    
    I don't see how you can argue this. The allow_downscale feature is
    very new, and is still not supported
(https://github.com/corosync/corosync/blob/master/man/votequorum.5#L301).

    
      becoming so prevalent, the need for
        dynamic clusters is growing very
        

        rapidly. There are several other projects which are implementing
        this
        

        functionality, such as etcd
        

(https://github.com/coreos/etcd/blob/master/Documentation/design/cluster-finding.md)
        

        and consul
        (http://www.consul.io/intro/getting-started/join.html). But
        

        these other services tend to be key/value stores, utilize a very
        heavy
        

        protocol (such as http), and don't offer a CPG type service.
        

      first keep in mind that all RAFT based protocols (so both etcd and
      consul) need quorum. Corosync itself doesn't need it (+ pacemaker
      can also work in "without quorum" mode). In other words, when RAFT
      loose quorum, whole cluster is dead and manual intervention is
      needed. This is in STRICT opposite with last-man-standing
      behavior. I see this as bigger blocker then tcp/http/cpg
      (actually, cpg is pretty easily implementable by using key/value
      store).
      

      So question is, why you cannot use multicast?
      

    You nailed it earlier, cloud networks don't allow multicast or
    broadcast. I've even worked for companies who's network admins don't
    allow it either.

    
      Other question is, did you tried multicast? If so, is multicast
      behavior something you would like to achieve with UDPU?
      

    Mostly. It seems like it's on the right track, but downscaling is
    problematic.

    
      Regards,
      

        Honza
      

        Regards,
          

             Honza
          

          Patrick Hemmer napsal(a):
          

          From: Patrick Hemmer
            <corosync@xxxxxxxxxxxxxxx>
            

            Sent: 2014-06-16 11:25:40 EDT
            

            To: Jan Friesse <jfriesse@xxxxxxxxxx>,
            discuss@xxxxxxxxxxxx
            

            Subject: Re:  automatic membership discovery
            

            On 2014/06/16 11:25, Patrick Hemmer wrote:
            

            Patrick,
              

              I'm interested in having corosync
                automatically accept members into the
                

                cluster without manual reconfiguration. Meaning that
                when I bring a new
                

                node online, I want to configure it for the existing
                nodes, and those
                

                nodes will automatically add the new node into their
                nodelist.
                

                 From a purely technical standpoint, this doesn't seem
                like it would be
                

                hard to do. The only 2 things you have to do to add a
                node are add the
                

                nodelist.node.X.nodeid and ring0_addr to cmap. When the
                new node comes
                

                up, it starts sending out messages to the existing
                nodes. The ring0_addr
                

                can be discovered from the source address, and the
                nodeid is in the message.
                

              I need to think about this little deeper. It sounds like
              it may work,
              

              but I'm not entirely sure.
              

              Going even further, when using the
                allow_downscale and last_man_standing
                

                features, we can automatically remove nodes from the
                cluster when they
                

                disappear. With last_man_standing, the quorum expected
                votes is
                

                automatically adjusted when a node is lost, so it makes
                no difference
                

                whether the node is offline, or removed. Then with the
                auto-join
                

                functionality, it'll automatically be added back in when
                it
                

                re-establishes communication.
                

                It might then even be possible to write the cmap data
                out to a file when
                

                a node joins or leaves. This way if corosync restarts,
                and the
                

                corosync.conf hasn't been updated, the nodelist can be
                read from this
                

                save. If the save is out of date, and some nodes are
                unreachable, they
                

                would simply be removed, and added when they join.
                

                This wouldn't even have to be a part of corosync. Could
                have some
                

                external utility watch the cmap values, and take care of
                setting them
                

                when corosync is launched.
                

                Ultimately this allows us to have a large scale
                dynamically sized
                

                cluster without having to edit the config of every node
                each time a node
                

                joins or leaves.
                

              Actually, this is exactly what pcs does.
              

            Unfortunately pcs has lots of issues.
            

              1. It assumes you will be using pacemaker as well.
            

                 In some of our uses, we are using corosync without
            pacemaker.
            

              2. It still has *lots* of bugs. Even more once you start
            trying to use
            

                 non-fedora based distros.
            

                 Some bugs have been open on the project for a year and
            a half.
            

              3. It doesn't know the real address of its own host.
            

                 What I mean is when a node is sitting behind NAT. We
            plan on running
            

                 corosync inside a docker container, and the container
            goes through
            

                 NAT if it needs to talk to another host. So pcs would
            need to know
            

                 the NAT address to advertise it to the other hosts.
            With the method
            

                 described here, that address is automatically
            discovered.
            

              4. Doesn't handle automatic cleanup.
            

                 If you remove a node, something has to go and clean
            that node up.
            

                 Basically you would have to write a program to connect
            to the quorum
            

                 service and monitor for nodes going down, and then
            remove them. But
            

                 then what happens if that node was only temporarily
            down? Who is
            

                 responsible for adding it back into the cluster? If the
            node that
            

                 was down is responsible for adding itself back in, what
            if another
            

                 node joined the cluster while it was down? Its list
            will be
            

                 incomplete. You could do a few things to try and
            alleviate these
            

                 headaches, but automatic membership just feels more
            like the right
            

                 solution.
            

              5. It doesn't allow you to adjust the config file.
            

              This really doesn't sound like it
                would be hard to do. I might even be
                

                willing to attempt implementing it myself if this sounds
                like something
                

                that would be acceptable to merge into the code base.
                

                Thoughts?
                

              Yes, but question is if it is really worth of it. I mean:
              

              - With multicast you have FULLY dynamic membership
              

              - PCS is able to distribute config file so adding new node
              to UDPU
              

              cluster is easy
              

              Do you see any use case where pcs or multicast doesn't
              work? (to
              

              clarify. I'm not blaming your idea (actually I find it
              interesting) but
              

              I'm trying to find out real killer use case for this
              feature which
              

              implementation will take quite a lot time almost for
              sure).
              

            Aside from the pcs issues mentioned above, having this in
            corosync just
            

            feels like the right solution. No external processes
            involved, no
            

            additional lines of communication, real-time on-demand
            updating. The end
            

            goal might be able to be accomplished by modifying pcs to
            resolve the
            

            issues, but is that the right way? If people want to use
            crmsh over pcs,
            

            do they not get this functionality?
            

            Regards,
              

                 Honza
              

              -Patrick
                

                _______________________________________________
                

                discuss mailing list
                

                discuss@xxxxxxxxxxxx
                

                http://lists.corosync.org/mailman/listinfo/discuss
                

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss