Re: [RFC] quorum module configuration bits

"Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> · Wed, 11 Jan 2012 08:49:26 +0100

On 1/11/2012 7:41 AM, Andrew Beekhof wrote:
> On Wed, Jan 11, 2012 at 4:50 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
>> On 01/10/2012 11:47 PM, Andrew Beekhof wrote:
>>> On Tue, Jan 10, 2012 at 9:08 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
>>>> Hi all,
>>>>
>>>> in some recent discussions, it come up the issue on how to configure
>>>> quorum module. As I don´t really have a complete solution yet, I need to
>>>> seek advice in the community :)
>>>>
>>>> Problem:
>>>>
>>>> it would be very nice if corosync.conf could be simply scp´ed/copied
>>>> between nodes and everything works as expected on all nodes.
>>>> Issue being that some quorum bits are, at this point in time, node
>>>> specific. It means that to alter some values, it is necessary to edit
>>>> corosync.conf on the specific node.
>>>> On top of that, it would be nice if expected_votes could be
>>>> automatically calculated based on votes: values.
>>>>
>>>> The current quorum configuration (based on topic-quorum patches):
>>>>
>>>> quorum {
>>>>    provider: corosync_votequorum
>>>>    expected_votes: 8
>>>>    votes: 1
>>>>    two_node: 0
>>>>    wait_for_all: 0
>>>>    last_man_standing: 0
>>>>    auto_tie_breaker: 0
>>>> }
>>>>
>>>> totem {
>>>>    nodeid: xxx
>>>> }
>>>>
>>>> The 2 values that cannot be copied around are quorum.votes and totem.nodeid.
>>>>
>>>> In current votequorum/totem incarnation, votes/expected_votes/nodeid are
>>>> all broadcasted to all nodes. so each node that joins the cluster
>>>> becomes aware of the other peers values.
>>>>
>>>> As a consequence of the current config format, auto_tie_breaker feature,
>>>> requires wait_for_all to work (in order to have the complete list of
>>>> nodeids, see auto_tie_breaker implementation in topic-quorum branch for
>>>> details).
>>>>
>>>> Honza and I quickly explored options to add those values into the node
>>>> list of udpu, but that´s limiting because it doesn´t work well in
>>>> multicast and/or broadcast and it has integration issues with RRP.
>>>>
>>>> Also adding lists to quorum {} involves a certain level of duplicated
>>>> information.
>>>>
>>>> For example:
>>>>
>>>> quorum {
>>>>   nodeid_list: x y z...
>>>>   node.x.votes: ..
>>>>   node.y.votes: ..
>>>> }
>>>>
>>>> that IMHO is all but nice to look at.
>>>>
>>>> So the question of changing the config format also raise the following
>>>> questions:
>>>>
>>>> 1) do we really need to support an auto_tie_breaker feature without
>>>> wait_for_all? if NO, then we don´t need the list of nodeids upfront.
>>>>
>>>> 2) do we really care about votes other than 1?
>>>
>>> That was also my question when reading the above.
>>> It always struck me as troublesome to get right, just giving one of 4
>>> nodes an extra vote (for example) will still give you a tie under the
>>> wrong conditions.
>>>
>>> Seems (to me) like a habit people got into when clusters went to
>>> pieces without quorum and that we have "better" solutions today (like
>>> the token registry).
>>> So my vote is drop it.
>>
>> That was my take too in the beginning but apparently there are some use
>> cases that require votes != 1.
> 
> Can someone enumerate a couple?  Maybe they're valid, maybe they're not.

Lon/David need to pich in here. Lon gave me an example with some magic
numbers but I keep forgetting to write it down.

> 
>>>> If NO, then votes: can
>>>> simply be dropped from corosync.conf defaults, and in case an override
>>>> is necessary, it can be done specific to the node. This solution poses
>>>> the problem that expected_votes need to be set in corosync.conf (one
>>>> liner in the config file vs different liners) but it might be slightly
>>>> more tricky to calculate if votes are not balanced.
>>>
>>> Any chance the value could be incremented based on the number of nodes
>>> ever seen?
>>> Ie. if count(active peers) > expected votes, update the config file.
>>
>> expected_votes is already calculated that way. If you configure 8 but
>> all of a sudden you see 9 nodes, then expected_votes is incremented.
>> The above is true also if one node starts voting differently (1 -> X)
>> then expected_votes is updated across the cluster automagically.
>> Writing to file is unnecessary operation with votequorum current
>> incarnation.
> 
> I'm not sure about that.
> If it was 3 and gets runtime bumped to 5, then two of the original 3
> could come back up thinking they have quorum (at the same time the
> remaining 3 legitimately retain quorum).
> 
> Or am I missing something?

I would expects admins to update corosync.conf as node counts increase,
but the automatic increase is there as fail safe.

At the same time, when a node joins a running cluster, even if it has
expected votes set to 1, it would receive the highest expected vote in
the cluster from the other nodes.

Yes, it doesn´t protect against stupid user errors that will not
increase the expected votes and that partition case. That would make
"write to config file" a good thing, but I doubt corosync has that
option right now.

> 
>>
>>
>>>
>>> That way most people could simply ignore the setting until they wanted
>>> to remove a node.
>>
>> Not that simple no.
>>
>> There are several cases where expected_votes is required to be known
>> upfront specially when handling partitions and startups.
>>
>> Let say you have 8 nodes cluster. quorum expected to be 5.
> 
> Err. Why would you ever do that?  And wouldn't the above logic bump it
> to 8 at runtime?

Uh? 8 / 2 + 1 = 5

If I expect 8 nodes, 1 vote each, quorum is 5. expected_votes != quorum.

expected votes is the highest number of votes in a cluster.

> 
>> Switch between 4 nodes and 4 nodes is dead or mulfunctioning. By using
>> an incremental expected_votes, you can effectively start 2 clusters.
> 
> You can, but you'd probably stop after the 5th node didn't join the first four.
> Because if you're writing the highest value back to corosync.conf then
> the only time you could hit this situation is on first cluster boot

Right assuming you write that value back to corosync.conf, I agree, but
that also implies that you have seen all cluster nodes up at once at
least one time.

At the end, I think it´s a lot safer to just know expected_votes upfront
and a lot less complicated for the user to bring the cluster up.

> (and you don't bring up all members of a brand new cluster all at
> once).

ehhh we can´t assume that. customers do that and we have seen bugs
related to this condition.

> 
>> Both clusters would be quorate, with expected_votes set to 4 and quorum
>> to 3. No guarantee those will merge. I doubt we want this situation to
>> ever exists.
>>
>> also, it would break the wait_for_all feature (or WFA would need to
>> require expected_votes .. either way).
> 
> Again, it only affects the first time you bring up the cluster.
> After that, expected_votes would have been (auto) set correctly and
> wait_for_all would work as expected.
> 

wait_for_all is only useful when you bring the cluster up for the very
first time... the two options conflict.

Fabio
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss