Patrick,
Now let's say you really cannot use multicast (what is sadly
highly probable in cloud environment).
First thing I've totally didn't got is, how whole thing can work
(reliable) without persistent node list.
There would be a node list, but lets say it's expensive to obtain,
such as having to make a remote call to an external service. So it
can be done, but it can't be used to detect nodes joining and
leaving (polling based, not push).
*From: *Jan Friesse
<jfriesse@xxxxxxxxxx>
*Sent: * 2014-06-19 09:50:17 EDT
*To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>,
discuss@xxxxxxxxxxxx
*Subject: *Re: automatic membership discovery
Patrick,
so just to recapitulate your idea. Let's say you have cluster
with 2
nodes. Now, you will decide to add third node. Your idea is
about
properly configure 3rd node (so if we would distribute that
config file,
call reload on every node, everything would work), in other
words, add
3rd node ONLY to config file on 3rd node and then start
corosync. Other
nodes will just accept node, add it to their membership (and
probably
some kind of automatically generated persistent list of
nodes). Do I
understand it correctly?
I hadn't considered a persistent storage of nodes as a
requirement. But
Ok. Now I'm totally lost. I cannot imagine how this can work
WITHOUT persistent storage of nodes? I mean, let's say you have 5
nodes. If I understood it correctly, their config file will
probably looks like:
1st node - 1 node (only itself)
2nd node - 2 nodes (node 1 and node 2)
3rd node - 3 nodes (node 1, node 2, node 3)
...
Then everything is ok. But what if user will decide to have config
with following content:
1st node - 1 node (only itself)
2nd node - 2 nodes (node 1 and node 2)
3rd node - 2 nodes (node 2 and node 3)
4th node - 2 nodes (node 3 and node 4)
5th node - 2 nodes (node 4 and node 5)
Such config is perfectly valid and when executing nodes in correct
order you have 5 nodes cluster, right? Now let's say that cluster
is stopped. You will enter iptables blocking rule so node 1 sees
node 2 (but no other nodes), node 2 sees node 1 (but no other
nodes), node 3 sees node 4 (but no other nodes) and node 4 sees
node 3 (but no other nodes).
You will start node 1 - 4 and ..., you have TWO perfectly quorate
clusters, right?
Sorry, I think I see where the confusion is coming in. We can set a
rule such that when corosync starts, it either starts with only
itself in the node list (either waiting to be contacted, or for some
external thing to populate it), or it starts with a full nodelist.
if you wanted to persist the discovered
nodes, you could have something
(whether corosync, or an external tool) watch the cmap nodelist,
and
write out to a file when the nodelist changes.
Honestly, I would rather like to talk about use case/higher level
rather then concrete implementation. I mean, concrete
implementation make us focused on one way, but there may be other
ways. So something like storing/not storing nodes, ...
I didn't consider it a requirement as I
considered the possible
scenarios that would result in a split brain to be near
impossible. For
example, if you just have a config file where the only node is
itself,
when it comes up, it could be made such that it doesn't get
consensus
until it can contact another node. When it does, that other node
would
share the quorum info, and perhaps even the nodelist. In the
event that
any number of nodes fail, the last_man_standing behavior will
keep the
cluster from going split brain (a node will only be removed from
the
nodelist if it leaves gracefully, or cluster maintains quorum
for the
duration of the LMS window).
Basically the only scenario I can think of that could result in
a split
brain is if 2 nodes shut down without a persistent nodelist, and
then
started back up and were somehow told about each other, but not
the rest
of the cluster.
In fact persistent storage might even be a problem. If a node
goes down,
and while it's down another node leaves the cluster, when it
comes back
up, it won't know that node is gone. Though you could solve this
by
obtaining the nodelist from the rest of the cluster (if the rest
of the
cluster is quorate).
Basically:
* A node cannot be quorate by itself.
This is bad requirement. One node cluster is weird (but still
used), but 2 node cluster (what is actually probably most used
scenario) where one of node is in maintenance mode is perfectly
ok. Such cluster HAVE to be quorate.
This wouldn't apply to all usages of corosync. This would be an
operational mode corosync can be in, like two-node is.
* Corosync will add any node that contacts
it to its own node list.
* Upon join, the side of the cluster that is quorate will send
its
quorum information (expected votes, current votes, etc) to the
inquorate
side.
This is already happening.
Except for things like downscaling.
* (uncertain) If quorate, corosync may
share the nodelist with the rest
of the cluster (the new node learns existing nodes &
existing nodes
learn the new node without it contacting them).
* If a node leaves gracefully, it will be removed from the
nodelist.
* If a node leaves ungracefully, it will be removed if the
cluster
remains quorate for the duration of last man standing window.
Because if so, I believe it would mean also change config
file, simply
to keep them in sync. And honestly, keeping config file is for
sure a
way I would like to go, but that way is very hard. Every
single thing
must be very well defined (like what is synchronized and what
is not).
Yes, I wouldn't consider removing the config file. Though one
possibility might be keeping the node list separate from the
config
file, and letting corosync update that.
As simple as the idea is, it may indeed be that this isn't the
direction
corosync should go. Traditionally corosync has been geared more
towards
static clusters that don't change often. But with cloud
computing
corosync is not designed for static clusters. Actually it's pretty
opposite. UDPU is newer mode. Original UDP (which still exists, is
still supported and still default) handles dynamic clusters very
well (as long as HW is able to do multicast).
I don't see how you can argue this. The allow_downscale feature is
very new, and is still not supported
(https://github.com/corosync/corosync/blob/master/man/votequorum.5#L301).
becoming so prevalent, the need for
dynamic clusters is growing very
rapidly. There are several other projects which are implementing
this
functionality, such as etcd
(https://github.com/coreos/etcd/blob/master/Documentation/design/cluster-finding.md)
and consul
(http://www.consul.io/intro/getting-started/join.html). But
these other services tend to be key/value stores, utilize a very
heavy
protocol (such as http), and don't offer a CPG type service.
first keep in mind that all RAFT based protocols (so both etcd and
consul) need quorum. Corosync itself doesn't need it (+ pacemaker
can also work in "without quorum" mode). In other words, when RAFT
loose quorum, whole cluster is dead and manual intervention is
needed. This is in STRICT opposite with last-man-standing
behavior. I see this as bigger blocker then tcp/http/cpg
(actually, cpg is pretty easily implementable by using key/value
store).
So question is, why you cannot use multicast?
You nailed it earlier, cloud networks don't allow multicast or
broadcast. I've even worked for companies who's network admins don't
allow it either.
Other question is, did you tried multicast? If so, is multicast
behavior something you would like to achieve with UDPU?
Mostly. It seems like it's on the right track, but downscaling is
problematic.
Regards,
Honza
Regards,
Honza
Patrick Hemmer napsal(a):
From: Patrick Hemmer
<corosync@xxxxxxxxxxxxxxx>
Sent: 2014-06-16 11:25:40 EDT
To: Jan Friesse <jfriesse@xxxxxxxxxx>,
discuss@xxxxxxxxxxxx
Subject: Re: automatic membership discovery
On 2014/06/16 11:25, Patrick Hemmer wrote:
Patrick,
I'm interested in having corosync
automatically accept members into the
cluster without manual reconfiguration. Meaning that
when I bring a new
node online, I want to configure it for the existing
nodes, and those
nodes will automatically add the new node into their
nodelist.
From a purely technical standpoint, this doesn't seem
like it would be
hard to do. The only 2 things you have to do to add a
node are add the
nodelist.node.X.nodeid and ring0_addr to cmap. When the
new node comes
up, it starts sending out messages to the existing
nodes. The ring0_addr
can be discovered from the source address, and the
nodeid is in the message.
I need to think about this little deeper. It sounds like
it may work,
but I'm not entirely sure.
Going even further, when using the
allow_downscale and last_man_standing
features, we can automatically remove nodes from the
cluster when they
disappear. With last_man_standing, the quorum expected
votes is
automatically adjusted when a node is lost, so it makes
no difference
whether the node is offline, or removed. Then with the
auto-join
functionality, it'll automatically be added back in when
it
re-establishes communication.
It might then even be possible to write the cmap data
out to a file when
a node joins or leaves. This way if corosync restarts,
and the
corosync.conf hasn't been updated, the nodelist can be
read from this
save. If the save is out of date, and some nodes are
unreachable, they
would simply be removed, and added when they join.
This wouldn't even have to be a part of corosync. Could
have some
external utility watch the cmap values, and take care of
setting them
when corosync is launched.
Ultimately this allows us to have a large scale
dynamically sized
cluster without having to edit the config of every node
each time a node
joins or leaves.
Actually, this is exactly what pcs does.
Unfortunately pcs has lots of issues.
1. It assumes you will be using pacemaker as well.
In some of our uses, we are using corosync without
pacemaker.
2. It still has *lots* of bugs. Even more once you start
trying to use
non-fedora based distros.
Some bugs have been open on the project for a year and
a half.
3. It doesn't know the real address of its own host.
What I mean is when a node is sitting behind NAT. We
plan on running
corosync inside a docker container, and the container
goes through
NAT if it needs to talk to another host. So pcs would
need to know
the NAT address to advertise it to the other hosts.
With the method
described here, that address is automatically
discovered.
4. Doesn't handle automatic cleanup.
If you remove a node, something has to go and clean
that node up.
Basically you would have to write a program to connect
to the quorum
service and monitor for nodes going down, and then
remove them. But
then what happens if that node was only temporarily
down? Who is
responsible for adding it back into the cluster? If the
node that
was down is responsible for adding itself back in, what
if another
node joined the cluster while it was down? Its list
will be
incomplete. You could do a few things to try and
alleviate these
headaches, but automatic membership just feels more
like the right
solution.
5. It doesn't allow you to adjust the config file.
This really doesn't sound like it
would be hard to do. I might even be
willing to attempt implementing it myself if this sounds
like something
that would be acceptable to merge into the code base.
Thoughts?
Yes, but question is if it is really worth of it. I mean:
- With multicast you have FULLY dynamic membership
- PCS is able to distribute config file so adding new node
to UDPU
cluster is easy
Do you see any use case where pcs or multicast doesn't
work? (to
clarify. I'm not blaming your idea (actually I find it
interesting) but
I'm trying to find out real killer use case for this
feature which
implementation will take quite a lot time almost for
sure).
Aside from the pcs issues mentioned above, having this in
corosync just
feels like the right solution. No external processes
involved, no
additional lines of communication, real-time on-demand
updating. The end
goal might be able to be accomplished by modifying pcs to
resolve the
issues, but is that the right way? If people want to use
crmsh over pcs,
do they not get this functionality?
Regards,
Honza
-Patrick
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss
|