Re: [RFC] quorum module configuration bits

Andrew Beekhof <andrew@xxxxxxxxxxx> · Wed, 11 Jan 2012 20:42:16 +1100

On Wed, Jan 11, 2012 at 6:49 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
> On 1/11/2012 7:41 AM, Andrew Beekhof wrote:
>> On Wed, Jan 11, 2012 at 4:50 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
>>> On 01/10/2012 11:47 PM, Andrew Beekhof wrote:
>>>> On Tue, Jan 10, 2012 at 9:08 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
>>>>> Hi all,
>>>>>
>>>>> in some recent discussions, it come up the issue on how to configure
>>>>> quorum module. As I don´t really have a complete solution yet, I need to
>>>>> seek advice in the community :)
>>>>>
>>>>> Problem:
>>>>>
>>>>> it would be very nice if corosync.conf could be simply scp´ed/copied
>>>>> between nodes and everything works as expected on all nodes.
>>>>> Issue being that some quorum bits are, at this point in time, node
>>>>> specific. It means that to alter some values, it is necessary to edit
>>>>> corosync.conf on the specific node.
>>>>> On top of that, it would be nice if expected_votes could be
>>>>> automatically calculated based on votes: values.
>>>>>
>>>>> The current quorum configuration (based on topic-quorum patches):
>>>>>
>>>>> quorum {
>>>>>    provider: corosync_votequorum
>>>>>    expected_votes: 8
>>>>>    votes: 1
>>>>>    two_node: 0
>>>>>    wait_for_all: 0
>>>>>    last_man_standing: 0
>>>>>    auto_tie_breaker: 0
>>>>> }
>>>>>
>>>>> totem {
>>>>>    nodeid: xxx
>>>>> }
>>>>>
>>>>> The 2 values that cannot be copied around are quorum.votes and totem.nodeid.
>>>>>
>>>>> In current votequorum/totem incarnation, votes/expected_votes/nodeid are
>>>>> all broadcasted to all nodes. so each node that joins the cluster
>>>>> becomes aware of the other peers values.
>>>>>
>>>>> As a consequence of the current config format, auto_tie_breaker feature,
>>>>> requires wait_for_all to work (in order to have the complete list of
>>>>> nodeids, see auto_tie_breaker implementation in topic-quorum branch for
>>>>> details).
>>>>>
>>>>> Honza and I quickly explored options to add those values into the node
>>>>> list of udpu, but that´s limiting because it doesn´t work well in
>>>>> multicast and/or broadcast and it has integration issues with RRP.
>>>>>
>>>>> Also adding lists to quorum {} involves a certain level of duplicated
>>>>> information.
>>>>>
>>>>> For example:
>>>>>
>>>>> quorum {
>>>>>   nodeid_list: x y z...
>>>>>   node.x.votes: ..
>>>>>   node.y.votes: ..
>>>>> }
>>>>>
>>>>> that IMHO is all but nice to look at.
>>>>>
>>>>> So the question of changing the config format also raise the following
>>>>> questions:
>>>>>
>>>>> 1) do we really need to support an auto_tie_breaker feature without
>>>>> wait_for_all? if NO, then we don´t need the list of nodeids upfront.
>>>>>
>>>>> 2) do we really care about votes other than 1?
>>>>
>>>> That was also my question when reading the above.
>>>> It always struck me as troublesome to get right, just giving one of 4
>>>> nodes an extra vote (for example) will still give you a tie under the
>>>> wrong conditions.
>>>>
>>>> Seems (to me) like a habit people got into when clusters went to
>>>> pieces without quorum and that we have "better" solutions today (like
>>>> the token registry).
>>>> So my vote is drop it.
>>>
>>> That was my take too in the beginning but apparently there are some use
>>> cases that require votes != 1.
>>
>> Can someone enumerate a couple?  Maybe they're valid, maybe they're not.
>
> Lon/David need to pich in here. Lon gave me an example with some magic
> numbers but I keep forgetting to write it down.
>
>>
>>>>> If NO, then votes: can
>>>>> simply be dropped from corosync.conf defaults, and in case an override
>>>>> is necessary, it can be done specific to the node. This solution poses
>>>>> the problem that expected_votes need to be set in corosync.conf (one
>>>>> liner in the config file vs different liners) but it might be slightly
>>>>> more tricky to calculate if votes are not balanced.
>>>>
>>>> Any chance the value could be incremented based on the number of nodes
>>>> ever seen?
>>>> Ie. if count(active peers) > expected votes, update the config file.
>>>
>>> expected_votes is already calculated that way. If you configure 8 but
>>> all of a sudden you see 9 nodes, then expected_votes is incremented.
>>> The above is true also if one node starts voting differently (1 -> X)
>>> then expected_votes is updated across the cluster automagically.
>>> Writing to file is unnecessary operation with votequorum current
>>> incarnation.
>>
>> I'm not sure about that.
>> If it was 3 and gets runtime bumped to 5, then two of the original 3
>> could come back up thinking they have quorum (at the same time the
>> remaining 3 legitimately retain quorum).
>>
>> Or am I missing something?
>
> I would expects admins to update corosync.conf as node counts increase,
> but the automatic increase is there as fail safe.
>
> At the same time, when a node joins a running cluster, even if it has
> expected votes set to 1, it would receive the highest expected vote in
> the cluster from the other nodes.
>
> Yes, it doesn´t protect against stupid user errors that will not
> increase the expected votes and that partition case. That would make
> "write to config file" a good thing, but I doubt corosync has that
> option right now.
>
>>
>>>
>>>
>>>>
>>>> That way most people could simply ignore the setting until they wanted
>>>> to remove a node.
>>>
>>> Not that simple no.
>>>
>>> There are several cases where expected_votes is required to be known
>>> upfront specially when handling partitions and startups.
>>>
>>> Let say you have 8 nodes cluster. quorum expected to be 5.
>>
>> Err. Why would you ever do that?  And wouldn't the above logic bump it
>> to 8 at runtime?
>
> Uh? 8 / 2 + 1 = 5

For those playing along at home, I thought Fabio was saying expected_votes=5
not that quorum was reached at 5.

>
> If I expect 8 nodes, 1 vote each, quorum is 5. expected_votes != quorum.
>
> expected votes is the highest number of votes in a cluster.
>
>>
>>> Switch between 4 nodes and 4 nodes is dead or mulfunctioning. By using
>>> an incremental expected_votes, you can effectively start 2 clusters.
>>
>> You can, but you'd probably stop after the 5th node didn't join the first four.
>> Because if you're writing the highest value back to corosync.conf then
>> the only time you could hit this situation is on first cluster boot
>
> Right assuming you write that value back to corosync.conf, I agree, but
> that also implies that you have seen all cluster nodes up at once at
> least one time.
>
> At the end, I think it´s a lot safer to just know expected_votes upfront
> and a lot less complicated for the user to bring the cluster up.
>
>> (and you don't bring up all members of a brand new cluster all at
>> once).
>
> ehhh we can´t assume that. customers do that and we have seen bugs
> related to this condition.

Again for those at home, I was talking about a cluster that had just
been installed and had not been previously started. Ever.
Fabio was talking about a cluster that was fully stopped but had been
started at some point in the past.

>
>>
>>> Both clusters would be quorate, with expected_votes set to 4 and quorum
>>> to 3. No guarantee those will merge. I doubt we want this situation to
>>> ever exists.
>>>
>>> also, it would break the wait_for_all feature (or WFA would need to
>>> require expected_votes .. either way).
>>
>> Again, it only affects the first time you bring up the cluster.
>> After that, expected_votes would have been (auto) set correctly and
>> wait_for_all would work as expected.
>>
>
> wait_for_all is only useful when you bring the cluster up for the very
> first time... the two options conflict.

Same as before.
I was talking about a cluster that had just been installed and had not
been previously started, Fabio was talking about a cluster that was
fully stopped.
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss