Re: [RFC] quorum module configuration bits

Vladislav Bogdanov <bubble@xxxxxxxxxxxxx> · Fri, 10 Feb 2012 08:17:49 +0300

10.02.2012 07:10, Fabio M. Di Nitto wrote:
> On 02/09/2012 09:07 PM, Vladislav Bogdanov wrote:
>> Hi Fabio,
>>
>> 09.02.2012 18:47, Fabio M. Di Nitto wrote:
>>> Hi Vladislav,
>>>
>>> On 1/27/2012 10:46 PM, Vladislav Bogdanov wrote:
>>>> 26.01.2012 15:41, Fabio M. Di Nitto wrote:
>>>>> On 1/26/2012 1:15 PM, Vladislav Bogdanov wrote:
>>>>>
>>>>>>>>> Probably even not lower than number of votes from nodes which are now
>>>>>>>>> either active or inactive but joined at least once (I suppose that
>>>>>>>>> nodelist is fully editable at runtime, so admin may some-how reset join
>>>>>>>>> count of node and only than reduce expected_votes).
>>>>>>>
>>>>>>> I have been thinking about this some more, but I am not sure I grasp the
>>>>>>> use case or what kind of protection you try to suggest.
>>>>>>>
>>>>>>> Reducing the number of expected_votes is an admin action, it´s not very
>>>>>>> different from removing a node from the "seen" list manually and
>>>>>>> recalculating expected_votes.
>>>>>>>
>>>>>>> Can you clarify it for me?
>>>>>>
>>>>>> Imagine (this case is a little bit hypothetical, but anyways):
>>>>>> * You have cluster with 8 active nodes, and you (for some historical
>>>>>> reasons or due to admin fault/laziness) have expected_votes set to 3
>>>>>> (ok, you had 3-node cluster not so long ago, but added more nodes
>>>>>> because of growing load).
>>>>>> * Cluster splits 5+3 due to loss of communication between switches (or
>>>>>> switch-stacks).
>>>>>> * 3 nodes are fenced.
>>>>>> * Partition with majority continues operation.
>>>>>> * 3 fenced nodes boot back, and form *quorate* partition because they
>>>>>> have expected_votes set to 3
>>>>>> * Data is corrupted
>>>>>>
>>>>>> If fenced nodes know right after boot that cluster consists of 8 active
>>>>>> nodes, they would not override expected_votes obtained from the
>>>>>> persistent "seen" list with the lower value from the config, and the
>>>>>> data is safe.
>>>>>
>>>>> Oh great.. yes I see where you are going here. It sounds an interesting
>>>>> approach but that clearly requires a file where to store those information.
>>>>
>>>> I do not see a big problem here...
>>>> Corosync saves its ring persistently anyways.
>>>>
>>>>>
>>>>> There is still a window where the file containing the expected_votes
>>>>> from "seen" list is corrupted tho. At that point it´s difficult to
>>>>> detect which of the two information is correct and it doesn´t prevent
>>>>> the issue at all if the file is removed entirely (even by accident), but
>>>>> at a first shot i would say that it is better than nothing.
>>>>
>>>> Hopefully at least not all nodes from a fenced partition will have it
>>>> corrupted/deleted. They should honor the maximal ev value from them all.
>>>>
>>>>>
>>>>> I´ll have a test and see how it pans out but at a first glance I think
>>>>> we should only store the last known expected_votes while quorate.
>>>>> The node booting would use the higher of the two values. If the cluster
>>>>> has decreased in size in the meantime, the node joining would be
>>>>> informed about it (just sent a patch to the list about it 10 minutes ago ;))
>>>
>>> so I am 99% done with this patch, by saving highest expected_votes and
>>> so on, but there is a corner case I am not entirely sure how to handle.
>>>
>>> Let´s take an example.
>>>
>>> 8 nodes cluster (each node votes 1 for simplicity)
>>> expected_votes set to 3
>>>
>>> 3 nodes are happily running and all...
>>>
>>> increase to 8 nodes
>>>
>>> new expected_votes is 8 (and we remember this by writing it on file).
>>>
>>> we scale back to 3 nodes at this point.
>>
>> This is a little bit unclean for me.
> 
> Maybe I didn't explain it properly. Let me try again.
> 
>> According to your last work, I suppose you mean that 5 nodes are just
>> cleanly shut down, and cluster reduces expected votes and quorum
>> accordingly.
>>
>> I do not have a strong PoV on leave_remove feature yet. On the one hand
>> it is handy. On the other it is dangerous at least, and the corner case
>> you talk about highlights this. After several hours of brainstorm I do
>> not see any clean solution for this case. Except to not allow automatic
>> expected votes decrease at all.
> 
> leave_remove requires a perfect clean node shutdown to work. Otherwise
> ev is not recalculated. Node starts to shutdown, sends a message to the
> other cluster node that it is leaving and the other nodes "downscale",
> but how this happen is irrelevant to this problem and dangerous no. It's
> something cman had for ages and worked pretty well.

I still not convinced and would prefer manual deletion...

> 
> The point was to reproduce your original use case:
> 
> Start with 3, scale up to 17 and then go back to 3.
> 
> Once you are back to 3, highest_ev is 17 (for now I didn't allow hev
> downscale/override yet and needs fixing for other use cases).
> 
> The process you used to go back to 3 is irrelevant (either manual or via
> leave_remove). With the final result that we want to avoid any of the
> shutdown node to gain quorum in a partition.

Main goal is to avoid data corruption (prevent from two quorate
partitions) I think. Stability is little bit less important here.

> 
>>
>> One possibility just came to mind two seconds ago: what if we just not
>> allow expected_votes to go below quorum based on higher_ever_seen?
>> That would help a lot, although it introduces
>> not-very-clean-for-everyone logic. It is just a raw idea, without any
>> logical background. Does it solve the problem? Comments are welcome.
>>
>> I mean, if you have higher_ever_seen:8, then expected_votes (runtime)
>> should not go below 8/2+1=5.
>> Of course this will raise handful of reports from users unless it is
>> documented IN CAPS with !!!!!!!!dozens of exclamation marks!!!!!!!!
>> (more than one time ;) ).
> 
> Hmmm that is an interesting approach yes but it is indeed rather
> confusing for the final user.

That is what I say above ;)

> 
> "Yes you can start with 3, scale up to N, but you can't go below
> quorum(N)..."

Unless you do manual intervention.

> 
>>
>> So, to set it lower one needs to some-how edit higher_ever_seen.
>>
>> Frankly speaking, my previous suggestion (keep persistent list of
>> cluster members) still valid for this. And I still really like it. Admin
>> would just say:
>> "Hey, node6 is not longer a part of cluster, please delete it from
>> everywhere." Node is removed from cmap and then higher_ever_seen is
>> recalculated automatically. Without this admin needs to "calculate" that
>> value. And even this simple arithmetic can be error-prone in some
>> circumstances (time pressure, sleepless night at work, etc.).
> 
> I still haven't integrated the highest_ever_seen calculation with
> nodelist (tho it's easy) or "downgrading" of highest_ever_seen.
> 
> The persistent list doesn't help me at all in this case.
> highest_ever_seen can only increase at this point in time, and
> eventually it can be downgraded manually (or via nodelist editing).

It would help to avoid a mess when node votes are not same for all
nodes. I'm sure that I will make mistake when I need to recalculate
something in a such heterogeneous cluster manually. But if I just say
"ok, node X is not supposed to be active any longer, please delete it
from any calculations", then chance for mistake is lower by the order of
magnitude.

> 
>>
>> The most important point here is that (still) possible split-brain is
>> caused not by a software decision but by the admin's action. You
>> understand what does it mean for support (and for judges in the worst
>> case ;) ).
> 
> Right, we are on the same page here.
> 
> In my example we can protect users against being "stupid" up to
> quorum(highest_expected_votes) basically. Can we do better than that?

I wouldn't say we can do anything better.

> 
> so if you have
> 17 nodes,
> hev is 17
> quorum(hev) = 9
> 
> The admin can "safely look good" by powering on by mistake up to 8
> nodes, but if he fires up 9, then the new quorate partition will fence
> the old one running services.

Only if you have startup fencing enabled. Otherwise you end up with data
corruption again.
And, even with startup fencing enabled you'll get fencing war after old
partition reboots back.

I really doubt we can easily avoid this.

Best,
Vladislav
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss