Re: automatic membership discovery

Patrick Hemmer <corosync@xxxxxxxxxxxxxxx> · Thu, 19 Jun 2014 23:18:38 -0400

    From: Jan Friesse <jfriesse@xxxxxxxxxx>

      Sent:  2014-06-19 09:50:17 EDT 
      To: Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>,
        discuss@xxxxxxxxxxxx
      Subject: Re:  automatic membership
        discovery

      Patrick,
so just to recapitulate your idea. Let's say you have cluster with 2
nodes. Now, you will decide to add third node. Your idea is about
properly configure 3rd node (so if we would distribute that config file,
call reload on every node, everything would work), in other words, add
3rd node ONLY to config file on 3rd node and then start corosync. Other
nodes will just accept node, add it to their membership (and probably
some kind of automatically generated persistent list of nodes). Do I
understand it correctly?

    I hadn't considered a persistent storage of nodes as a requirement.
    But if you wanted to persist the discovered nodes, you could have
    something (whether corosync, or an external tool) watch the cmap
    nodelist, and write out to a file when the nodelist changes.

    I didn't consider it a requirement as I considered the possible
    scenarios that would result in a split brain to be near impossible.
    For example, if you just have a config file where the only node is
    itself, when it comes up, it could be made such that it doesn't get
    consensus until it can contact another node. When it does, that
    other node would share the quorum info, and perhaps even the
    nodelist. In the event that any number of nodes fail, the
    last_man_standing behavior will keep the cluster from going split
    brain (a node will only be removed from the nodelist if it leaves
    gracefully, or cluster maintains quorum for the duration of the LMS
    window).

    Basically the only scenario I can think of that could result in a
    split brain is if 2 nodes shut down without a persistent nodelist,
    and then started back up and were somehow told about each other, but
    not the rest of the cluster.

    In fact persistent storage might even be a problem. If a node goes
    down, and while it's down another node leaves the cluster, when it
    comes back up, it won't know that node is gone. Though you could
    solve this by obtaining the nodelist from the rest of the cluster
    (if the rest of the cluster is quorate).

    Basically:

    * A node cannot be quorate by itself.

    * Corosync will add any node that contacts it to its own node list.

    * Upon join, the side of the cluster that is quorate will send its
    quorum information (expected votes, current votes, etc) to the
    inquorate side.

    * (uncertain) If quorate, corosync may share the nodelist with the
    rest of the cluster (the new node learns existing nodes &
    existing nodes learn the new node without it contacting them).

    * If a node leaves gracefully, it will be removed from the nodelist.

    * If a node leaves ungracefully, it will be removed if the cluster
    remains quorate for the duration of last man standing window.

Because if so, I believe it would mean also change config file, simply
to keep them in sync. And honestly, keeping config file is for sure a
way I would like to go, but that way is very hard. Every single thing
must be very well defined (like what is synchronized and what is not).

    Yes, I wouldn't consider removing the config file. Though one
    possibility might be keeping the node list separate from the config
    file, and letting corosync update that.

    As simple as the idea is, it may indeed be that this isn't the
    direction corosync should go. Traditionally corosync has been geared
    more towards static clusters that don't change often. But with cloud
    computing becoming so prevalent, the need for dynamic clusters is
    growing very rapidly. There are several other projects which are
    implementing this functionality, such as etcd
    (https://github.com/coreos/etcd/blob/master/Documentation/design/cluster-finding.md)
    and consul (http://www.consul.io/intro/getting-started/join.html).
    But these other services tend to be key/value stores, utilize a very
    heavy protocol (such as http), and don't offer a CPG type service.

      Regards,
  Honza

Patrick Hemmer napsal(a):

        From: Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>
Sent: 2014-06-16 11:25:40 EDT
To: Jan Friesse <jfriesse@xxxxxxxxxx>, discuss@xxxxxxxxxxxx
Subject: Re:  automatic membership discovery

On 2014/06/16 11:25, Patrick Hemmer wrote:

          Patrick,

            I'm interested in having corosync automatically accept members into the
cluster without manual reconfiguration. Meaning that when I bring a new
node online, I want to configure it for the existing nodes, and those
nodes will automatically add the new node into their nodelist.
>From a purely technical standpoint, this doesn't seem like it would be
hard to do. The only 2 things you have to do to add a node are add the
nodelist.node.X.nodeid and ring0_addr to cmap. When the new node comes
up, it starts sending out messages to the existing nodes. The ring0_addr
can be discovered from the source address, and the nodeid is in the message.

          I need to think about this little deeper. It sounds like it may work,
but I'm not entirely sure.

            Going even further, when using the allow_downscale and last_man_standing
features, we can automatically remove nodes from the cluster when they
disappear. With last_man_standing, the quorum expected votes is
automatically adjusted when a node is lost, so it makes no difference
whether the node is offline, or removed. Then with the auto-join
functionality, it'll automatically be added back in when it
re-establishes communication.

It might then even be possible to write the cmap data out to a file when
a node joins or leaves. This way if corosync restarts, and the
corosync.conf hasn't been updated, the nodelist can be read from this
save. If the save is out of date, and some nodes are unreachable, they
would simply be removed, and added when they join.
This wouldn't even have to be a part of corosync. Could have some
external utility watch the cmap values, and take care of setting them
when corosync is launched.

Ultimately this allows us to have a large scale dynamically sized
cluster without having to edit the config of every node each time a node
joins or leaves.

          Actually, this is exactly what pcs does.

        Unfortunately pcs has lots of issues.

 1. It assumes you will be using pacemaker as well.
    In some of our uses, we are using corosync without pacemaker.

 2. It still has *lots* of bugs. Even more once you start trying to use
    non-fedora based distros.
    Some bugs have been open on the project for a year and a half.

 3. It doesn't know the real address of its own host.
    What I mean is when a node is sitting behind NAT. We plan on running
    corosync inside a docker container, and the container goes through
    NAT if it needs to talk to another host. So pcs would need to know
    the NAT address to advertise it to the other hosts. With the method
    described here, that address is automatically discovered.

 4. Doesn't handle automatic cleanup.
    If you remove a node, something has to go and clean that node up.
    Basically you would have to write a program to connect to the quorum
    service and monitor for nodes going down, and then remove them. But
    then what happens if that node was only temporarily down? Who is
    responsible for adding it back into the cluster? If the node that
    was down is responsible for adding itself back in, what if another
    node joined the cluster while it was down? Its list will be
    incomplete. You could do a few things to try and alleviate these
    headaches, but automatic membership just feels more like the right
    solution.

 5. It doesn't allow you to adjust the config file.

            This really doesn't sound like it would be hard to do. I might even be
willing to attempt implementing it myself if this sounds like something
that would be acceptable to merge into the code base.
Thoughts?

          Yes, but question is if it is really worth of it. I mean:
- With multicast you have FULLY dynamic membership
- PCS is able to distribute config file so adding new node to UDPU
cluster is easy

Do you see any use case where pcs or multicast doesn't work? (to
clarify. I'm not blaming your idea (actually I find it interesting) but
I'm trying to find out real killer use case for this feature which
implementation will take quite a lot time almost for sure).

        Aside from the pcs issues mentioned above, having this in corosync just
feels like the right solution. No external processes involved, no
additional lines of communication, real-time on-demand updating. The end
goal might be able to be accomplished by modifying pcs to resolve the
issues, but is that the right way? If people want to use crmsh over pcs,
do they not get this functionality?

          Regards,
  Honza

            -Patrick

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss