Re: [RFC] quorum module configuration bits

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/13/2012 9:26 AM, Vladislav Bogdanov wrote:
> 10.02.2012 11:55, Fabio M. Di Nitto wrote:
>> On 2/10/2012 9:14 AM, Vladislav Bogdanov wrote:
>>> [snip for readability just to highlight one idea]
>>
>> wfm ;)
>>
>>>>>
>>>>> Either way, internally, i don´t need to exchange the list of seen nodes
>>>>> because either the nodelist from corosync.conf _or_ the calculation
>>>>> request will tell me what to do.
>>>>
>>>> For me it is always preferred to have important statements listed
>>>> explicitly. Implicit ones always leave chance to be interpreted incorrectly,
>>>>
>>>> Look:
>>>> "You have cluster of max 8 nodes with max 10 votes, and 4 of them with 5
>>>> votes are known to be active. I wont say which ones, just trust me."
>>>>
>>>> "You have cluster of max 8 nodes, and nodes A, B, C, D are active. Nodes
>>>> E, F, G, H are not active. A and E has two votes each, all others have
>>>> one vote each."
>>>>
>>>> I would always prefer latter statement.
>>>> (This example has nothing to split-brain discussion, just an implicit
>>>> vs. explicit example)
>>>>
>>> [snip]
>>>>
>>>> I'd also some-how recommend that even with redundant ring cluster should
>>>> never be put into a "undetermined" state by powering-off old partition,
>>>> powering-on new one and then powering-on old one again.
>>>> Do not know why, but I feel that dangerous. May be my feeling is not valid.
>>>
>>> Just to become synchronized.
>>>
>>> Taking the example above:
>>> You have ABCD running, 4 nodes 5 votes. expected_votes is 5,
>>> higher_ever_seen is 5.
>>
>> correct.
>>
>>> You shutdown ABCD and then poweron EFGH. Cluster runs with 4 nodes 5
>>> votes. expected_votes is 5, higher_ever_seen is 5.
>>
>> If the shutdown and power on are done in two distinct stages (first
>> complete shutdown and then poweron), then yes, that´s correct.
> 
> Yes, I meant that.
> 
>>
>>> You poweron A.
>>>
>>> What would be the correct final expected_votes value?
>>
>> It only depends on what A votes (you don´t say in the above example ;))
> 
> "A and E has two votes each"
> 
>>
>> If A votes 1, then you get expected_votes: 6, higher_ever_seen: 6.
>> 2 votes, then you get 7/7 (to state the obvious)
>>
>>> It would be 7 with you approach and 10 with "seen" list
>>
>> ABCD have never "seen" EFGH before but now EFGH can see A. So it´s
>> either 6 or 7 (based on A votes and current implementation).
> 
> Understand your point. Just wanted to know your opinion.
> 
>>
>> But there is still an issue with the seen list when you move a bit away
>> from this example.
>>
>> 10 nodes (all votes 1)
>>
>> ABCDEFGHJK
>>
>> ABCDEF running.
>> ev:6 hes: 6
>>
>> shutdown ABCDEF
>> (dunno why you would do that, but customers and users do the strangest
>> things)
> 
> ;)
> 
>>
>> poweron GHJK
>> ev: 4 hes: 4
>>
>> poweron A
>> ev: 10 hes: 10 total_votes in the cluster 5 < quorum 6 -> KABOOM?
> 
> Not really, I'd expect that. And that was a major reason for me to ask
> "what is the right behavior".
> 
> My idea was it that ev and quorum are modified according to new member's
> point of view. So, if A knows BCDEF, then the whole cluster should know
> them unless A's persistent data is cleaned manually (?).
> 
> (GHJK enter)
> G: Hello guys HJK, we are four here, and three of us are enough to make
> decisions.
> HJK: ack
> (GHJK are doing something)
> (A enters)
> A: Oh, no, please wait, I know that we also have BCDEF somewhere here,
> so please postpone any actions until they arrive because they may have a
> different vision on what to do. This way you still have a chance to not
> break something valuable!
> (your scenario)
> GHJK: nope
> (my scenario)
> GHJK: ack
> 
> Anyways, this is just to decide what is safer, just throw previous
> membership information away, or use the biggest known set of members.
> 
> And, I do not know which scenario is actually better (or just
> "expected") when it comes to major upper layer consumers (e.g.
> pacemaker, dlm). For example, I do not know what would node-list in
> pacemaker's CIB look like after such scenario finishes. For me it would
> be great if both quorum engine and pacemaker have a consensus on "whom
> do we know here".
> 
> Maybe Andrew and David can comment (I added you guys to CC)?

Well, the thing is that I can´t send back a seen list to those layers.
The membership information has to be consistent across and it comes from
totem. quorum does not decide that. Even if I have a "seen" list, it
would be very tricky to pass it back to the upper layers and that would
create an inconsistent view of the membership (bad).

The "seen" list has to remain internal to quorum. Upper layers only care
about:

- nodes that are part of the current membership
- if a transition occurs, which nodes have joined and/or left
- quorate status (0|1).

How we get to quorate status it is irrelevant to the upper layers.

> 
>>
>>> (assuming we do
>>> not have leave_remove active, otherwise it may vary from 7 to 10,
>>> depending on order in which ABCD have left the cluster).
>>
>> Let´s put aside leave_remove for now, it does not affect
>> highest_ever_seen as-is now and that integration bit is still missing
>> even from my head. Let´s see if we can come down to a correct ehs
>> handling, then we can take a look at integrating with other features.
>>
>>> But which of them is a correct one?
>>
>> I guess it´s up to us to define what is correct.
>>
>> So far "seen" for me means that a certain node has seen another node
>> live at least once (after that I can track the state).
> 
> I'd say "seen" means that node knew some other node to be an active
> cluster member last time that first node was active.

Noted.. let see what other thinks about all of the above. At this point
I am going to stage the patch while we agree on the correct behavior (if
implementable at all in votequorum). It might not make 2.0, but possibly
one of the updates.

Fabio

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss



[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux