Re: [RFC] quorum module configuration bits

Vladislav Bogdanov <bubble@xxxxxxxxxxxxx> · Mon, 13 Feb 2012 11:42:49 +0300

10.02.2012 11:32, Fabio M. Di Nitto wrote:
> On 2/10/2012 8:49 AM, Vladislav Bogdanov wrote:
>>>
>>>>
>>>>>
>>>>>>
>>>>>> So, to set it lower one needs to some-how edit higher_ever_seen.
>>>>>>
>>>>>> Frankly speaking, my previous suggestion (keep persistent list of
>>>>>> cluster members) still valid for this. And I still really like it. Admin
>>>>>> would just say:
>>>>>> "Hey, node6 is not longer a part of cluster, please delete it from
>>>>>> everywhere." Node is removed from cmap and then higher_ever_seen is
>>>>>> recalculated automatically. Without this admin needs to "calculate" that
>>>>>> value. And even this simple arithmetic can be error-prone in some
>>>>>> circumstances (time pressure, sleepless night at work, etc.).
>>>>>
>>>>> I still haven't integrated the highest_ever_seen calculation with
>>>>> nodelist (tho it's easy) or "downgrading" of highest_ever_seen.
>>>>>
>>>>> The persistent list doesn't help me at all in this case.
>>>>> highest_ever_seen can only increase at this point in time, and
>>>>> eventually it can be downgraded manually (or via nodelist editing).
>>>>
>>>> It would help to avoid a mess when node votes are not same for all
>>>> nodes. I'm sure that I will make mistake when I need to recalculate
>>>> something in a such heterogeneous cluster manually. But if I just say
>>>> "ok, node X is not supposed to be active any longer, please delete it
>>>> from any calculations", then chance for mistake is lower by the order of
>>>> magnitude.
>>>
>>> Ok, let´s try to recap this one second because i see this from the
>>> votequorum internal calculation/code perspective and you from final user
>>> point of view (that is good so we can find gaps ;)).
>>
>> Great.
>>
>>>
>>> My understanding is that:
>>>
>>> N node cluster, where votes are not even.
>>>
>>> At some point in time you shutdown node X (that votes something)
>>>
>>> That nodeX is marked "DEAD" in the node list and it stops voting.
>>> nodeX votes are still used to calculate expected_votes.
>>>
>>> Now, you want to tell votequorum that nodeX is gone and recalculate
>>> expected_votes.
>>>
>>> You have two options:
>>>
>>> 1) temporary remove the node from the calculation:
>>>
>>> corosync-quorumtool -e $value
>>>
>>> where value can be anything below current votes (just enter 1 to make
>>> your life simpler), will pull down expected_votes to current node votes.
>>>
>>> Given that expected_votes can never be lower than total_votes in the
>>> current cluster, votequorum will do the calculation for you correctly.
>>
>> Although it is not intuitive and has some implicit
>> not-very-clean-initially logic, I can probably live with that.
> 
> Why isn´t intuitive? it´s actually rather logical in my head.
> 
> You can´t have expected_votes below current cluster votes. It has no use
> case or useful meaning to have expected_votes < total votes.
> 
> Note that I am not talking about configuration or configuration override
> here. I am talking about runtime.
> 
>>
>> Alternative more intuitive command would be "shrink expected_votes to
>> current total_votes".
> 
> That can be done in terms of user interface or some better
> documentation, doesn´t require internal changes .

Agree.
>From the users point I'd prefer to have a dedicated documented option,
which just results in "-e 1" internally.

> 
>>> 2) remove the node forever from the nodelist and votequorum will pick it
>>> up automatically.
>>>
>>> $somecorosync-tool-magic-i-dont-know-the-syntax
>>>
>>> same property for expected_votes apply here and it will be recalculated
>>> for you.
>>
>> Do you mean dynamic removal of node from a config file, or just from
>> internal in-process list?
>> Former is a no-go I'd say, latter returns us back to list of "seen"
>> nodes, otherwise cluster restart returns you to a previous state.
> 
> None of the tools mingle with corosync.conf directly and probably never
> will. A node removal would probably be a 2 admin steps. Update
> corosync.conf to drop it from the node list, use magic-tool to remove it
> at runtime.
> 
> Well no, if you start a cluster with the wrong corosync.conf it´s yet
> again a bad user error and a seen list doesn´t protect you 100% anyway.
> Given enough wrong parameters in corosync.conf there is no magic that
> quorum can do to protect you. Not even a seen list. Otherwise you have
> no way to know which of the two you can trust at startup.
> 
> Are we using a wrong corosync.conf or our seen list is old/obsoleted?
> 
> The _only_ protection you have here is if a node is starting with bad
> parameters and it is joining an already quorate partition. The data
> coming from the quorate partition will override local ones.
> 
> But if the node is starting in a non quorate partition (probably even a
> full cluster restart), you can´t say for sure who to trust.
> 
>>>
>>> highest_seen_votes can then be lowered to current expected_votes in this
>>> case since it is an admin request to lower or change everything.
>>>
>>> Either way, internally, i don´t need to exchange the list of seen nodes
>>> because either the nodelist from corosync.conf _or_ the calculation
>>> request will tell me what to do.
>>
>> For me it is always preferred to have important statements listed
>> explicitly. Implicit ones always leave chance to be interpreted incorrectly,
>>
>> Look:
>> "You have cluster of max 8 nodes with max 10 votes, and 4 of them with 5
>> votes are known to be active. I wont say which ones, just trust me."
>>
>> "You have cluster of max 8 nodes, and nodes A, B, C, D are active. Nodes
>> E, F, G, H are not active. A and E has two votes each, all others have
>> one vote each."
>>
>> I would always prefer latter statement.
>> (This example has nothing to split-brain discussion, just an implicit
>> vs. explicit example)
>>
> 
> you have those info from the nodelist already, specially if you like
> everything explicit.

But how could I distinguish known active but shutdown node from not active?

> 
>>>>> The admin can "safely look good" by powering on by mistake up to 8
>>>>> nodes, but if he fires up 9, then the new quorate partition will fence
>>>>> the old one running services.
>>>>
>>>> Only if you have startup fencing enabled. Otherwise you end up with data
>>>> corruption again.
>>>> And, even with startup fencing enabled you'll get fencing war after old
>>>> partition reboots back.
>>>>
>>>> I really doubt we can easily avoid this.
>>>
>>> Ok, i guess we are heading to the same conclusions that this is pretty
>>> much the case of "don´t execute rm -rf /". We can only protect users up
>>> to a certain point, if they like to shoot themselves we can´t do
>>> anything about it.
>>
>> Some bits of documentation with advises would be nice to have.
>> F.e.
>> It is not 100% safe to have expected_votes set manually to a value less
>> than N/2+1 where N is total number of votes from all possible cluster
>> members (highest possible quorum value). If you still want to do that,
>> then you need to guarantee that cluster never partitions (f.e. with
>> redundant ring configuration). Otherwise there is possibility to have
>> all your data corrupted.
> 
> Man pages updates are a no brainer once we define the behavior :) Most
> likely some of those checks can be implemented in the user tools too
> (just need to be done based on enabled features and so on).
> 
>> Ugh! What if just deny that^ by default for one-ring config? Possibly
>> with some magic hard-to-configure parameter (md5/sha from corosync key
>> file?) to allow operation even on one ring (as it is possible to have
>> fine-crafted network setup which provides guaranties even with one ring
>> - f.e. LACP bonding over switch-stack, where same bonds are used for
>> *both* cluster communication and data access, and every node is
>> connected to at least two different stack members).
> 
> hmmmmm  I am not entirely convinced about this. I would like to keep
> quorum feature set independent from ring configuration because as you
> say, it is also dependent on hw setup and such. I feel this lands more
> in the documentation area.

That would be is just a balance between safety and ease of setup. You
should agree that very often manual is consulted only when something
goes wrong. Such feature would just help to prevent the "too late to
RTFM" state.

> 
>>
>> I'd also some-how recommend that even with redundant ring cluster should
>> never be put into a "undetermined" state by powering-off old partition,
>> powering-on new one and then powering-on old one again.
>> Do not know why, but I feel that dangerous. May be my feeling is not valid.
> 
> (i´ll reply to this in the other email you sent)
> 
> Fabio

Best,
Vladislav
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss