Re: [RFC] quorum module configuration bits

"Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> · Thu, 09 Feb 2012 16:47:43 +0100

Hi Vladislav,

On 1/27/2012 10:46 PM, Vladislav Bogdanov wrote:
> 26.01.2012 15:41, Fabio M. Di Nitto wrote:
>> On 1/26/2012 1:15 PM, Vladislav Bogdanov wrote:
>>
>>>>>> Probably even not lower than number of votes from nodes which are now
>>>>>> either active or inactive but joined at least once (I suppose that
>>>>>> nodelist is fully editable at runtime, so admin may some-how reset join
>>>>>> count of node and only than reduce expected_votes).
>>>>
>>>> I have been thinking about this some more, but I am not sure I grasp the
>>>> use case or what kind of protection you try to suggest.
>>>>
>>>> Reducing the number of expected_votes is an admin action, it´s not very
>>>> different from removing a node from the "seen" list manually and
>>>> recalculating expected_votes.
>>>>
>>>> Can you clarify it for me?
>>>
>>> Imagine (this case is a little bit hypothetical, but anyways):
>>> * You have cluster with 8 active nodes, and you (for some historical
>>> reasons or due to admin fault/laziness) have expected_votes set to 3
>>> (ok, you had 3-node cluster not so long ago, but added more nodes
>>> because of growing load).
>>> * Cluster splits 5+3 due to loss of communication between switches (or
>>> switch-stacks).
>>> * 3 nodes are fenced.
>>> * Partition with majority continues operation.
>>> * 3 fenced nodes boot back, and form *quorate* partition because they
>>> have expected_votes set to 3
>>> * Data is corrupted
>>>
>>> If fenced nodes know right after boot that cluster consists of 8 active
>>> nodes, they would not override expected_votes obtained from the
>>> persistent "seen" list with the lower value from the config, and the
>>> data is safe.
>>
>> Oh great.. yes I see where you are going here. It sounds an interesting
>> approach but that clearly requires a file where to store those information.
> 
> I do not see a big problem here...
> Corosync saves its ring persistently anyways.
> 
>>
>> There is still a window where the file containing the expected_votes
>> from "seen" list is corrupted tho. At that point it´s difficult to
>> detect which of the two information is correct and it doesn´t prevent
>> the issue at all if the file is removed entirely (even by accident), but
>> at a first shot i would say that it is better than nothing.
> 
> Hopefully at least not all nodes from a fenced partition will have it
> corrupted/deleted. They should honor the maximal ev value from them all.
> 
>>
>> I´ll have a test and see how it pans out but at a first glance I think
>> we should only store the last known expected_votes while quorate.
>> The node booting would use the higher of the two values. If the cluster
>> has decreased in size in the meantime, the node joining would be
>> informed about it (just sent a patch to the list about it 10 minutes ago ;))

so I am 99% done with this patch, by saving highest expected_votes and
so on, but there is a corner case I am not entirely sure how to handle.

Let´s take an example.

8 nodes cluster (each node votes 1 for simplicity)
expected_votes set to 3

3 nodes are happily running and all...

increase to 8 nodes

new expected_votes is 8 (and we remember this by writing it on file).

we scale back to 3 nodes at this point.

expected_votes (runtime): 3
higher_ever_seen: 8
quorum: 5

in the simplest scenario where 3/4 nodes boot up in a separate
partition, we are good, the partition would not be quorate.

But in the worst case scenario, where 5 nodes boot up in a separate
partitions, those can actually become quorate.

That´s clearly bad.

I can´t really think of a way to avoid 2 partitions to go quorate at
this point. I know it´s a rather extreme corner case, but it is a case
that can happen (and be sure customers will make it happen ;))

Any suggestions?

Thanks
Fabio

PS this patch + the recently posted leave_remove: option would give you
100% freedom to scale back and forth without even touching ev at
runtime. Just need to solve this case....
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss