Re: [RFC] quorum module configuration bits

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



26.01.2012 15:41, Fabio M. Di Nitto wrote:
> On 1/26/2012 1:15 PM, Vladislav Bogdanov wrote:
> 
>>>>> Probably even not lower than number of votes from nodes which are now
>>>>> either active or inactive but joined at least once (I suppose that
>>>>> nodelist is fully editable at runtime, so admin may some-how reset join
>>>>> count of node and only than reduce expected_votes).
>>>
>>> I have been thinking about this some more, but I am not sure I grasp the
>>> use case or what kind of protection you try to suggest.
>>>
>>> Reducing the number of expected_votes is an admin action, it´s not very
>>> different from removing a node from the "seen" list manually and
>>> recalculating expected_votes.
>>>
>>> Can you clarify it for me?
>>
>> Imagine (this case is a little bit hypothetical, but anyways):
>> * You have cluster with 8 active nodes, and you (for some historical
>> reasons or due to admin fault/laziness) have expected_votes set to 3
>> (ok, you had 3-node cluster not so long ago, but added more nodes
>> because of growing load).
>> * Cluster splits 5+3 due to loss of communication between switches (or
>> switch-stacks).
>> * 3 nodes are fenced.
>> * Partition with majority continues operation.
>> * 3 fenced nodes boot back, and form *quorate* partition because they
>> have expected_votes set to 3
>> * Data is corrupted
>>
>> If fenced nodes know right after boot that cluster consists of 8 active
>> nodes, they would not override expected_votes obtained from the
>> persistent "seen" list with the lower value from the config, and the
>> data is safe.
> 
> Oh great.. yes I see where you are going here. It sounds an interesting
> approach but that clearly requires a file where to store those information.

I do not see a big problem here...
Corosync saves its ring persistently anyways.

> 
> There is still a window where the file containing the expected_votes
> from "seen" list is corrupted tho. At that point it´s difficult to
> detect which of the two information is correct and it doesn´t prevent
> the issue at all if the file is removed entirely (even by accident), but
> at a first shot i would say that it is better than nothing.

Hopefully at least not all nodes from a fenced partition will have it
corrupted/deleted. They should honor the maximal ev value from them all.

> 
> I´ll have a test and see how it pans out but at a first glance I think
> we should only store the last known expected_votes while quorate.
> The node booting would use the higher of the two values. If the cluster
> has decreased in size in the meantime, the node joining would be
> informed about it (just sent a patch to the list about it 10 minutes ago ;))

I'd argue that you do not know who is the last known (or ever known)
active then.
Dynamically handled persistent list is much better from this point of
view. At it resembles what pacemaker does right now. This is probably
the major value for me.

If we talk that pacemaker is to be the major consumer for new stack (if
I understand redhat plans correctly), then we should preserve current
behavior as much as possible. Now pacemaker never adds node to CIB if it
was not seen before (independently of how much nodes do you have in
memberlist in flatiron UDPU). I'm pretty happy with that as an admin.
And I definitely want that to be preserved because in this case I can
pre-configure a whole cluster at the very beginning and initially power
on only number of nodes I really need, activating more nodes
dynamically, and not bothering about right quorum calculations at all
(until I make some mistake which really needs some admin intervention).

Again, currently pacemaker will not try to fence nodes which are not
stated in a CIB but are listed in memberlist. That of course could be
thought as a pacemaker weakness, but I do not really see any
not-very-exotic case for that. But, with that pacemaker will never try
to fence not-known-to-be-active node at startup (and will never fail to
do that, freezing the whole cluster!). This is really usefull if you
have IPMI as a primary fencing channel (ok, most of us do have that?),
and you have no nodes installed in their slots yet and do not have their
IPMI controllers reachable.

I'll try to argue to David here too: if we have a persistent list, then
we never ever need to fence the whole actually quorate partition, we
just know how much of us are alive from a *list*. We simply will not
obtain quorum until admin tells us so and will not touch any resources,
so data is safe.
And, I really do not like when quorate partition disappears in a moment
when some communication problems occur and admin just has no time to
stop minor partition from booting, because every problem needs some
investigations, but modern servers boot quickly, not more then 5 mins
even with PXE.

So, from my point of view, safest way to go is to know who *really* form
a cluster when we re-join it, and thinking the worst scenario if we are
adding a new first node to an empty (from that node PoV) cluster.

Best,
Vladislav
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss



[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux