Re: [RFC] quorum module configuration bits

"Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> · Fri, 10 Feb 2012 07:52:29 +0100

On 2/10/2012 6:17 AM, Vladislav Bogdanov wrote:
> 10.02.2012 07:10, Fabio M. Di Nitto wrote:
>> On 02/09/2012 09:07 PM, Vladislav Bogdanov wrote:
>>> Hi Fabio,
>>>
>>> 09.02.2012 18:47, Fabio M. Di Nitto wrote:
>>>> Hi Vladislav,
>>>>
>>>> On 1/27/2012 10:46 PM, Vladislav Bogdanov wrote:
>>>>> 26.01.2012 15:41, Fabio M. Di Nitto wrote:
>>>>>> On 1/26/2012 1:15 PM, Vladislav Bogdanov wrote:
>>>>>>
>>>>>>>>>> Probably even not lower than number of votes from nodes which are now
>>>>>>>>>> either active or inactive but joined at least once (I suppose that
>>>>>>>>>> nodelist is fully editable at runtime, so admin may some-how reset join
>>>>>>>>>> count of node and only than reduce expected_votes).
>>>>>>>>
>>>>>>>> I have been thinking about this some more, but I am not sure I grasp the
>>>>>>>> use case or what kind of protection you try to suggest.
>>>>>>>>
>>>>>>>> Reducing the number of expected_votes is an admin action, it´s not very
>>>>>>>> different from removing a node from the "seen" list manually and
>>>>>>>> recalculating expected_votes.
>>>>>>>>
>>>>>>>> Can you clarify it for me?
>>>>>>>
>>>>>>> Imagine (this case is a little bit hypothetical, but anyways):
>>>>>>> * You have cluster with 8 active nodes, and you (for some historical
>>>>>>> reasons or due to admin fault/laziness) have expected_votes set to 3
>>>>>>> (ok, you had 3-node cluster not so long ago, but added more nodes
>>>>>>> because of growing load).
>>>>>>> * Cluster splits 5+3 due to loss of communication between switches (or
>>>>>>> switch-stacks).
>>>>>>> * 3 nodes are fenced.
>>>>>>> * Partition with majority continues operation.
>>>>>>> * 3 fenced nodes boot back, and form *quorate* partition because they
>>>>>>> have expected_votes set to 3
>>>>>>> * Data is corrupted
>>>>>>>
>>>>>>> If fenced nodes know right after boot that cluster consists of 8 active
>>>>>>> nodes, they would not override expected_votes obtained from the
>>>>>>> persistent "seen" list with the lower value from the config, and the
>>>>>>> data is safe.
>>>>>>
>>>>>> Oh great.. yes I see where you are going here. It sounds an interesting
>>>>>> approach but that clearly requires a file where to store those information.
>>>>>
>>>>> I do not see a big problem here...
>>>>> Corosync saves its ring persistently anyways.
>>>>>
>>>>>>
>>>>>> There is still a window where the file containing the expected_votes
>>>>>> from "seen" list is corrupted tho. At that point it´s difficult to
>>>>>> detect which of the two information is correct and it doesn´t prevent
>>>>>> the issue at all if the file is removed entirely (even by accident), but
>>>>>> at a first shot i would say that it is better than nothing.
>>>>>
>>>>> Hopefully at least not all nodes from a fenced partition will have it
>>>>> corrupted/deleted. They should honor the maximal ev value from them all.
>>>>>
>>>>>>
>>>>>> I´ll have a test and see how it pans out but at a first glance I think
>>>>>> we should only store the last known expected_votes while quorate.
>>>>>> The node booting would use the higher of the two values. If the cluster
>>>>>> has decreased in size in the meantime, the node joining would be
>>>>>> informed about it (just sent a patch to the list about it 10 minutes ago ;))
>>>>
>>>> so I am 99% done with this patch, by saving highest expected_votes and
>>>> so on, but there is a corner case I am not entirely sure how to handle.
>>>>
>>>> Let´s take an example.
>>>>
>>>> 8 nodes cluster (each node votes 1 for simplicity)
>>>> expected_votes set to 3
>>>>
>>>> 3 nodes are happily running and all...
>>>>
>>>> increase to 8 nodes
>>>>
>>>> new expected_votes is 8 (and we remember this by writing it on file).
>>>>
>>>> we scale back to 3 nodes at this point.
>>>
>>> This is a little bit unclean for me.
>>
>> Maybe I didn't explain it properly. Let me try again.
>>
>>> According to your last work, I suppose you mean that 5 nodes are just
>>> cleanly shut down, and cluster reduces expected votes and quorum
>>> accordingly.
>>>
>>> I do not have a strong PoV on leave_remove feature yet. On the one hand
>>> it is handy. On the other it is dangerous at least, and the corner case
>>> you talk about highlights this. After several hours of brainstorm I do
>>> not see any clean solution for this case. Except to not allow automatic
>>> expected votes decrease at all.
>>
>> leave_remove requires a perfect clean node shutdown to work. Otherwise
>> ev is not recalculated. Node starts to shutdown, sends a message to the
>> other cluster node that it is leaving and the other nodes "downscale",
>> but how this happen is irrelevant to this problem and dangerous no. It's
>> something cman had for ages and worked pretty well.
> 
> I still not convinced and would prefer manual deletion...

Sure but leave_remove is never enabled by default. It´s a user choice,
like enabling highest_seen_tracking. Nothing says they need be used in
combo. It just makes my test easier by removing nodes automatically
instead of doing it manually.

> 
>>
>> The point was to reproduce your original use case:
>>
>> Start with 3, scale up to 17 and then go back to 3.
>>
>> Once you are back to 3, highest_ev is 17 (for now I didn't allow hev
>> downscale/override yet and needs fixing for other use cases).
>>
>> The process you used to go back to 3 is irrelevant (either manual or via
>> leave_remove). With the final result that we want to avoid any of the
>> shutdown node to gain quorum in a partition.
> 
> Main goal is to avoid data corruption (prevent from two quorate
> partitions) I think. Stability is little bit less important here.

Yeps, that´s why we try to implement those barriers. We are on the same
page regarding goals here.

> 
>>
>>>
>>> One possibility just came to mind two seconds ago: what if we just not
>>> allow expected_votes to go below quorum based on higher_ever_seen?
>>> That would help a lot, although it introduces
>>> not-very-clean-for-everyone logic. It is just a raw idea, without any
>>> logical background. Does it solve the problem? Comments are welcome.
>>>
>>> I mean, if you have higher_ever_seen:8, then expected_votes (runtime)
>>> should not go below 8/2+1=5.
>>> Of course this will raise handful of reports from users unless it is
>>> documented IN CAPS with !!!!!!!!dozens of exclamation marks!!!!!!!!
>>> (more than one time ;) ).
>>
>> Hmmm that is an interesting approach yes but it is indeed rather
>> confusing for the final user.
> 
> That is what I say above ;)
> 
>>
>> "Yes you can start with 3, scale up to N, but you can't go below
>> quorum(N)..."
> 
> Unless you do manual intervention.

Clearly. manual intervention is always there but we need to workaround
what can be done automagically.

> 
>>
>>>
>>> So, to set it lower one needs to some-how edit higher_ever_seen.
>>>
>>> Frankly speaking, my previous suggestion (keep persistent list of
>>> cluster members) still valid for this. And I still really like it. Admin
>>> would just say:
>>> "Hey, node6 is not longer a part of cluster, please delete it from
>>> everywhere." Node is removed from cmap and then higher_ever_seen is
>>> recalculated automatically. Without this admin needs to "calculate" that
>>> value. And even this simple arithmetic can be error-prone in some
>>> circumstances (time pressure, sleepless night at work, etc.).
>>
>> I still haven't integrated the highest_ever_seen calculation with
>> nodelist (tho it's easy) or "downgrading" of highest_ever_seen.
>>
>> The persistent list doesn't help me at all in this case.
>> highest_ever_seen can only increase at this point in time, and
>> eventually it can be downgraded manually (or via nodelist editing).
> 
> It would help to avoid a mess when node votes are not same for all
> nodes. I'm sure that I will make mistake when I need to recalculate
> something in a such heterogeneous cluster manually. But if I just say
> "ok, node X is not supposed to be active any longer, please delete it
> from any calculations", then chance for mistake is lower by the order of
> magnitude.

Ok, let´s try to recap this one second because i see this from the
votequorum internal calculation/code perspective and you from final user
point of view (that is good so we can find gaps ;)).

My understanding is that:

N node cluster, where votes are not even.

At some point in time you shutdown node X (that votes something)

That nodeX is marked "DEAD" in the node list and it stops voting.
nodeX votes are still used to calculate expected_votes.

Now, you want to tell votequorum that nodeX is gone and recalculate
expected_votes.

You have two options:

1) temporary remove the node from the calculation:

corosync-quorumtool -e $value

where value can be anything below current votes (just enter 1 to make
your life simpler), will pull down expected_votes to current node votes.

Given that expected_votes can never be lower than total_votes in the
current cluster, votequorum will do the calculation for you correctly.

2) remove the node forever from the nodelist and votequorum will pick it
up automatically.

$somecorosync-tool-magic-i-dont-know-the-syntax

same property for expected_votes apply here and it will be recalculated
for you.

highest_seen_votes can then be lowered to current expected_votes in this
case since it is an admin request to lower or change everything.

Either way, internally, i don´t need to exchange the list of seen nodes
because either the nodelist from corosync.conf _or_ the calculation
request will tell me what to do.

> 
>>
>>>
>>> The most important point here is that (still) possible split-brain is
>>> caused not by a software decision but by the admin's action. You
>>> understand what does it mean for support (and for judges in the worst
>>> case ;) ).
>>
>> Right, we are on the same page here.
>>
>> In my example we can protect users against being "stupid" up to
>> quorum(highest_expected_votes) basically. Can we do better than that?
> 
> I wouldn't say we can do anything better.
> 
>>
>> so if you have
>> 17 nodes,
>> hev is 17
>> quorum(hev) = 9
>>
>> The admin can "safely look good" by powering on by mistake up to 8
>> nodes, but if he fires up 9, then the new quorate partition will fence
>> the old one running services.
> 
> Only if you have startup fencing enabled. Otherwise you end up with data
> corruption again.
> And, even with startup fencing enabled you'll get fencing war after old
> partition reboots back.
> 
> I really doubt we can easily avoid this.

Ok, i guess we are heading to the same conclusions that this is pretty
much the case of "don´t execute rm -rf /". We can only protect users up
to a certain point, if they like to shoot themselves we can´t do
anything about it.

Fabio
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss