Re: [RFC] quorum module configuration bits

Vladislav Bogdanov <bubble@xxxxxxxxxxxxx> · Sat, 14 Jan 2012 18:32:11 +0300

14.01.2012 11:57, Fabio M. Di Nitto wrote:
> On 01/14/2012 09:09 AM, Vladislav Bogdanov wrote:
>> Hi,
>>
>> 13.01.2012 21:21, Fabio M. Di Nitto wrote:
>> [snip]
>>>> + expected_votes is removed and instead auto-calculated based upon
>>>> quorum_votes in the node list
>>
>> Is it possible to dynamically keep track of "seen" nodes here and use
>> only that nodes for expected_votes calculations?
>>
>> I even have a use-case for that:
>> I "run" cluster consisting of max 17 nodes with UDPU, so all nodes are
>> listed in config. Currently only 3 nodes are powered on, because I do
>> not have load which requires more yet (and power is expensive in
>> european datacenters). When load increases I'd just power on additional
>> nodes and quorum expectations are recalculated automagically.
>> I have that implemented with corosync + pacemaker right now. Pacemaker
>> keeps that list of nodes and does quorum calculations correctly. And I'm
>> absolutely happy with that. From what I see changes being discussed will
>> break my happiness.
> 
> Yes and no. Let me explain:
> 
> votequorum already does that internally. For example:
> 
> expected_votes: 3 in corosync.conf
> 
> you power on your 4th node (assuming everybody votes 1 to make it
> simpler in this example) and expected_votes is automatically bumped to 4
> on all nodes.
> 
> While this is what you are asking for, there are a few corner cases
> where this could lead to a dangerous situation.
> 
> First of all the new expected_votes is not written to disk but only
> retained internally to votequorum.

Actually I'd prefer it to be written together with that "seen" list, so
cluster knows who should be here even after a full restart. But I
neither like idea having it in a config nor being calculated based on a
nodelist. From my point of view, that is not a configuration variable,
but rather a "state" one. And it should be managed in a stateful way
(saved to disk),

> 
> This approach does not protects you against partitions properly.
> Specially at startup of the cluster. For example, out of 16 nodes, 8 are
> on switch A and 8 on switch B. Interlink between switches is broken. All
> nodes know of expected_votes: 3 from corosync.conf

If we have expected_votes not in config, but in some file in /var/lib
(corosync already does the same for rings), and managed dynamically
cluster-wide, that should be impossible (of course if admin didn't
delete that file on all nodes).
Cluster knows it has 16 active nodes. It even knows all its ever "seen"
members.

> 
> Both partitions of the cluster can achieve quorate status and they can
> create caos fencing each other, data corruption and all. Now, we agree
> that this is generally an admin error that doesn't notice that the
> interlink is down, but.. it leaves a window open for disasters.
> 
> On the other side, i am not going to force users to do it differently.
> Current votequorum implementation allows this use case, and i am not
> going to enforce differently. Users should still be aware of they are
> asking for tho.
> 
>>
>> It would also be great if I'm able to forcibly remove inactive node from
>> that "seen" list with just one command on *one* cluster node. Use case
>> for that is a human error when wrong node is powered on by mistake.
> 
> The "seen" list within the quorum module is dynamic. As soon as you
> shutdown a node (either clean or whatever) and totem notices that the
> node goes away, that same node is removed from the quorum calculation.

Ugh?
Do you mean that dynamic version of expected_votes is decremented
automatically?

> 
> Your concern is probably related to the discussed nodelist, but that's
> up to others to decide "how" to handle add/removal of nodes. It doesn't
> affect the quorum module at all.

Generally yes, it is about nodelist, vote-list and expected_votes in a
config file.

All I wanted to say is that I'm pretty happy with how pacemaker
implements quorum management (from admin's point of view).

If I power on more "unseen" nodes, expected_votes is automatically
incremented and saved into CIB. If I then power down that nodes, their
votes are still considered until I remove them from CIB and decrement
expected_votes manually (actually that part didn't fully work last time
I checked).

And I do not like idea of touching configuration file every time I want
to add node to cluster. And then re-distributing that config over all
nodes, and then reload it on every node.

Now I have all 17 nodes listed in corosync.conf (UDPU), but my
expected_votes in pacemaker CIB is 3. That's why Steve's idea of
calculating expected_votes from a vote-list would be a regression for me.

Vladislav
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss