Re: [RFC] quorum module configuration bits

"Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> · Wed, 11 Jan 2012 06:50:11 +0100

On 01/10/2012 11:47 PM, Andrew Beekhof wrote:
> On Tue, Jan 10, 2012 at 9:08 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
>> Hi all,
>>
>> in some recent discussions, it come up the issue on how to configure
>> quorum module. As I don´t really have a complete solution yet, I need to
>> seek advice in the community :)
>>
>> Problem:
>>
>> it would be very nice if corosync.conf could be simply scp´ed/copied
>> between nodes and everything works as expected on all nodes.
>> Issue being that some quorum bits are, at this point in time, node
>> specific. It means that to alter some values, it is necessary to edit
>> corosync.conf on the specific node.
>> On top of that, it would be nice if expected_votes could be
>> automatically calculated based on votes: values.
>>
>> The current quorum configuration (based on topic-quorum patches):
>>
>> quorum {
>>    provider: corosync_votequorum
>>    expected_votes: 8
>>    votes: 1
>>    two_node: 0
>>    wait_for_all: 0
>>    last_man_standing: 0
>>    auto_tie_breaker: 0
>> }
>>
>> totem {
>>    nodeid: xxx
>> }
>>
>> The 2 values that cannot be copied around are quorum.votes and totem.nodeid.
>>
>> In current votequorum/totem incarnation, votes/expected_votes/nodeid are
>> all broadcasted to all nodes. so each node that joins the cluster
>> becomes aware of the other peers values.
>>
>> As a consequence of the current config format, auto_tie_breaker feature,
>> requires wait_for_all to work (in order to have the complete list of
>> nodeids, see auto_tie_breaker implementation in topic-quorum branch for
>> details).
>>
>> Honza and I quickly explored options to add those values into the node
>> list of udpu, but that´s limiting because it doesn´t work well in
>> multicast and/or broadcast and it has integration issues with RRP.
>>
>> Also adding lists to quorum {} involves a certain level of duplicated
>> information.
>>
>> For example:
>>
>> quorum {
>>   nodeid_list: x y z...
>>   node.x.votes: ..
>>   node.y.votes: ..
>> }
>>
>> that IMHO is all but nice to look at.
>>
>> So the question of changing the config format also raise the following
>> questions:
>>
>> 1) do we really need to support an auto_tie_breaker feature without
>> wait_for_all? if NO, then we don´t need the list of nodeids upfront.
>>
>> 2) do we really care about votes other than 1?
> 
> That was also my question when reading the above.
> It always struck me as troublesome to get right, just giving one of 4
> nodes an extra vote (for example) will still give you a tie under the
> wrong conditions.
> 
> Seems (to me) like a habit people got into when clusters went to
> pieces without quorum and that we have "better" solutions today (like
> the token registry).
> So my vote is drop it.

That was my take too in the beginning but apparently there are some use
cases that require votes != 1.

> 
>> If NO, then votes: can
>> simply be dropped from corosync.conf defaults, and in case an override
>> is necessary, it can be done specific to the node. This solution poses
>> the problem that expected_votes need to be set in corosync.conf (one
>> liner in the config file vs different liners) but it might be slightly
>> more tricky to calculate if votes are not balanced.
> 
> Any chance the value could be incremented based on the number of nodes
> ever seen?
> Ie. if count(active peers) > expected votes, update the config file.

expected_votes is already calculated that way. If you configure 8 but
all of a sudden you see 9 nodes, then expected_votes is incremented.
The above is true also if one node starts voting differently (1 -> X)
then expected_votes is updated across the cluster automagically.
Writing to file is unnecessary operation with votequorum current
incarnation.

> 
> That way most people could simply ignore the setting until they wanted
> to remove a node.

Not that simple no.

There are several cases where expected_votes is required to be known
upfront specially when handling partitions and startups.

Let say you have 8 nodes cluster. quorum expected to be 5.

Switch between 4 nodes and 4 nodes is dead or mulfunctioning. By using
an incremental expected_votes, you can effectively start 2 clusters.

Both clusters would be quorate, with expected_votes set to 4 and quorum
to 3. No guarantee those will merge. I doubt we want this situation to
ever exists.

also, it would break the wait_for_all feature (or WFA would need to
require expected_votes .. either way).

Fabio
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss