On 2/10/2012 6:17 AM, Vladislav Bogdanov wrote: > 10.02.2012 07:10, Fabio M. Di Nitto wrote: >> On 02/09/2012 09:07 PM, Vladislav Bogdanov wrote: >>> Hi Fabio, >>> >>> 09.02.2012 18:47, Fabio M. Di Nitto wrote: >>>> Hi Vladislav, >>>> >>>> On 1/27/2012 10:46 PM, Vladislav Bogdanov wrote: >>>>> 26.01.2012 15:41, Fabio M. Di Nitto wrote: >>>>>> On 1/26/2012 1:15 PM, Vladislav Bogdanov wrote: >>>>>> >>>>>>>>>> Probably even not lower than number of votes from nodes which are now >>>>>>>>>> either active or inactive but joined at least once (I suppose that >>>>>>>>>> nodelist is fully editable at runtime, so admin may some-how reset join >>>>>>>>>> count of node and only than reduce expected_votes). >>>>>>>> >>>>>>>> I have been thinking about this some more, but I am not sure I grasp the >>>>>>>> use case or what kind of protection you try to suggest. >>>>>>>> >>>>>>>> Reducing the number of expected_votes is an admin action, it´s not very >>>>>>>> different from removing a node from the "seen" list manually and >>>>>>>> recalculating expected_votes. >>>>>>>> >>>>>>>> Can you clarify it for me? >>>>>>> >>>>>>> Imagine (this case is a little bit hypothetical, but anyways): >>>>>>> * You have cluster with 8 active nodes, and you (for some historical >>>>>>> reasons or due to admin fault/laziness) have expected_votes set to 3 >>>>>>> (ok, you had 3-node cluster not so long ago, but added more nodes >>>>>>> because of growing load). >>>>>>> * Cluster splits 5+3 due to loss of communication between switches (or >>>>>>> switch-stacks). >>>>>>> * 3 nodes are fenced. >>>>>>> * Partition with majority continues operation. >>>>>>> * 3 fenced nodes boot back, and form *quorate* partition because they >>>>>>> have expected_votes set to 3 >>>>>>> * Data is corrupted >>>>>>> >>>>>>> If fenced nodes know right after boot that cluster consists of 8 active >>>>>>> nodes, they would not override expected_votes obtained from the >>>>>>> persistent "seen" list with the lower value from the config, and the >>>>>>> data is safe. >>>>>> >>>>>> Oh great.. yes I see where you are going here. It sounds an interesting >>>>>> approach but that clearly requires a file where to store those information. >>>>> >>>>> I do not see a big problem here... >>>>> Corosync saves its ring persistently anyways. >>>>> >>>>>> >>>>>> There is still a window where the file containing the expected_votes >>>>>> from "seen" list is corrupted tho. At that point it´s difficult to >>>>>> detect which of the two information is correct and it doesn´t prevent >>>>>> the issue at all if the file is removed entirely (even by accident), but >>>>>> at a first shot i would say that it is better than nothing. >>>>> >>>>> Hopefully at least not all nodes from a fenced partition will have it >>>>> corrupted/deleted. They should honor the maximal ev value from them all. >>>>> >>>>>> >>>>>> I´ll have a test and see how it pans out but at a first glance I think >>>>>> we should only store the last known expected_votes while quorate. >>>>>> The node booting would use the higher of the two values. If the cluster >>>>>> has decreased in size in the meantime, the node joining would be >>>>>> informed about it (just sent a patch to the list about it 10 minutes ago ;)) >>>> >>>> so I am 99% done with this patch, by saving highest expected_votes and >>>> so on, but there is a corner case I am not entirely sure how to handle. >>>> >>>> Let´s take an example. >>>> >>>> 8 nodes cluster (each node votes 1 for simplicity) >>>> expected_votes set to 3 >>>> >>>> 3 nodes are happily running and all... >>>> >>>> increase to 8 nodes >>>> >>>> new expected_votes is 8 (and we remember this by writing it on file). >>>> >>>> we scale back to 3 nodes at this point. >>> >>> This is a little bit unclean for me. >> >> Maybe I didn't explain it properly. Let me try again. >> >>> According to your last work, I suppose you mean that 5 nodes are just >>> cleanly shut down, and cluster reduces expected votes and quorum >>> accordingly. >>> >>> I do not have a strong PoV on leave_remove feature yet. On the one hand >>> it is handy. On the other it is dangerous at least, and the corner case >>> you talk about highlights this. After several hours of brainstorm I do >>> not see any clean solution for this case. Except to not allow automatic >>> expected votes decrease at all. >> >> leave_remove requires a perfect clean node shutdown to work. Otherwise >> ev is not recalculated. Node starts to shutdown, sends a message to the >> other cluster node that it is leaving and the other nodes "downscale", >> but how this happen is irrelevant to this problem and dangerous no. It's >> something cman had for ages and worked pretty well. > > I still not convinced and would prefer manual deletion... Sure but leave_remove is never enabled by default. It´s a user choice, like enabling highest_seen_tracking. Nothing says they need be used in combo. It just makes my test easier by removing nodes automatically instead of doing it manually. > >> >> The point was to reproduce your original use case: >> >> Start with 3, scale up to 17 and then go back to 3. >> >> Once you are back to 3, highest_ev is 17 (for now I didn't allow hev >> downscale/override yet and needs fixing for other use cases). >> >> The process you used to go back to 3 is irrelevant (either manual or via >> leave_remove). With the final result that we want to avoid any of the >> shutdown node to gain quorum in a partition. > > Main goal is to avoid data corruption (prevent from two quorate > partitions) I think. Stability is little bit less important here. Yeps, that´s why we try to implement those barriers. We are on the same page regarding goals here. > >> >>> >>> One possibility just came to mind two seconds ago: what if we just not >>> allow expected_votes to go below quorum based on higher_ever_seen? >>> That would help a lot, although it introduces >>> not-very-clean-for-everyone logic. It is just a raw idea, without any >>> logical background. Does it solve the problem? Comments are welcome. >>> >>> I mean, if you have higher_ever_seen:8, then expected_votes (runtime) >>> should not go below 8/2+1=5. >>> Of course this will raise handful of reports from users unless it is >>> documented IN CAPS with !!!!!!!!dozens of exclamation marks!!!!!!!! >>> (more than one time ;) ). >> >> Hmmm that is an interesting approach yes but it is indeed rather >> confusing for the final user. > > That is what I say above ;) > >> >> "Yes you can start with 3, scale up to N, but you can't go below >> quorum(N)..." > > Unless you do manual intervention. Clearly. manual intervention is always there but we need to workaround what can be done automagically. > >> >>> >>> So, to set it lower one needs to some-how edit higher_ever_seen. >>> >>> Frankly speaking, my previous suggestion (keep persistent list of >>> cluster members) still valid for this. And I still really like it. Admin >>> would just say: >>> "Hey, node6 is not longer a part of cluster, please delete it from >>> everywhere." Node is removed from cmap and then higher_ever_seen is >>> recalculated automatically. Without this admin needs to "calculate" that >>> value. And even this simple arithmetic can be error-prone in some >>> circumstances (time pressure, sleepless night at work, etc.). >> >> I still haven't integrated the highest_ever_seen calculation with >> nodelist (tho it's easy) or "downgrading" of highest_ever_seen. >> >> The persistent list doesn't help me at all in this case. >> highest_ever_seen can only increase at this point in time, and >> eventually it can be downgraded manually (or via nodelist editing). > > It would help to avoid a mess when node votes are not same for all > nodes. I'm sure that I will make mistake when I need to recalculate > something in a such heterogeneous cluster manually. But if I just say > "ok, node X is not supposed to be active any longer, please delete it > from any calculations", then chance for mistake is lower by the order of > magnitude. Ok, let´s try to recap this one second because i see this from the votequorum internal calculation/code perspective and you from final user point of view (that is good so we can find gaps ;)). My understanding is that: N node cluster, where votes are not even. At some point in time you shutdown node X (that votes something) That nodeX is marked "DEAD" in the node list and it stops voting. nodeX votes are still used to calculate expected_votes. Now, you want to tell votequorum that nodeX is gone and recalculate expected_votes. You have two options: 1) temporary remove the node from the calculation: corosync-quorumtool -e $value where value can be anything below current votes (just enter 1 to make your life simpler), will pull down expected_votes to current node votes. Given that expected_votes can never be lower than total_votes in the current cluster, votequorum will do the calculation for you correctly. 2) remove the node forever from the nodelist and votequorum will pick it up automatically. $somecorosync-tool-magic-i-dont-know-the-syntax same property for expected_votes apply here and it will be recalculated for you. highest_seen_votes can then be lowered to current expected_votes in this case since it is an admin request to lower or change everything. Either way, internally, i don´t need to exchange the list of seen nodes because either the nodelist from corosync.conf _or_ the calculation request will tell me what to do. > >> >>> >>> The most important point here is that (still) possible split-brain is >>> caused not by a software decision but by the admin's action. You >>> understand what does it mean for support (and for judges in the worst >>> case ;) ). >> >> Right, we are on the same page here. >> >> In my example we can protect users against being "stupid" up to >> quorum(highest_expected_votes) basically. Can we do better than that? > > I wouldn't say we can do anything better. > >> >> so if you have >> 17 nodes, >> hev is 17 >> quorum(hev) = 9 >> >> The admin can "safely look good" by powering on by mistake up to 8 >> nodes, but if he fires up 9, then the new quorate partition will fence >> the old one running services. > > Only if you have startup fencing enabled. Otherwise you end up with data > corruption again. > And, even with startup fencing enabled you'll get fencing war after old > partition reboots back. > > I really doubt we can easily avoid this. Ok, i guess we are heading to the same conclusions that this is pretty much the case of "don´t execute rm -rf /". We can only protect users up to a certain point, if they like to shoot themselves we can´t do anything about it. Fabio _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss