10.02.2012 09:52, Fabio M. Di Nitto wrote: > On 2/10/2012 6:17 AM, Vladislav Bogdanov wrote: >> 10.02.2012 07:10, Fabio M. Di Nitto wrote: >>> On 02/09/2012 09:07 PM, Vladislav Bogdanov wrote: >>>> Hi Fabio, >>>> >>>> 09.02.2012 18:47, Fabio M. Di Nitto wrote: >>>>> Hi Vladislav, >>>>> >>>>> On 1/27/2012 10:46 PM, Vladislav Bogdanov wrote: >>>>>> 26.01.2012 15:41, Fabio M. Di Nitto wrote: >>>>>>> On 1/26/2012 1:15 PM, Vladislav Bogdanov wrote: >>>>>>> >>>>>>>>>>> Probably even not lower than number of votes from nodes which are now >>>>>>>>>>> either active or inactive but joined at least once (I suppose that >>>>>>>>>>> nodelist is fully editable at runtime, so admin may some-how reset join >>>>>>>>>>> count of node and only than reduce expected_votes). >>>>>>>>> >>>>>>>>> I have been thinking about this some more, but I am not sure I grasp the >>>>>>>>> use case or what kind of protection you try to suggest. >>>>>>>>> >>>>>>>>> Reducing the number of expected_votes is an admin action, it´s not very >>>>>>>>> different from removing a node from the "seen" list manually and >>>>>>>>> recalculating expected_votes. >>>>>>>>> >>>>>>>>> Can you clarify it for me? >>>>>>>> >>>>>>>> Imagine (this case is a little bit hypothetical, but anyways): >>>>>>>> * You have cluster with 8 active nodes, and you (for some historical >>>>>>>> reasons or due to admin fault/laziness) have expected_votes set to 3 >>>>>>>> (ok, you had 3-node cluster not so long ago, but added more nodes >>>>>>>> because of growing load). >>>>>>>> * Cluster splits 5+3 due to loss of communication between switches (or >>>>>>>> switch-stacks). >>>>>>>> * 3 nodes are fenced. >>>>>>>> * Partition with majority continues operation. >>>>>>>> * 3 fenced nodes boot back, and form *quorate* partition because they >>>>>>>> have expected_votes set to 3 >>>>>>>> * Data is corrupted >>>>>>>> >>>>>>>> If fenced nodes know right after boot that cluster consists of 8 active >>>>>>>> nodes, they would not override expected_votes obtained from the >>>>>>>> persistent "seen" list with the lower value from the config, and the >>>>>>>> data is safe. >>>>>>> >>>>>>> Oh great.. yes I see where you are going here. It sounds an interesting >>>>>>> approach but that clearly requires a file where to store those information. >>>>>> >>>>>> I do not see a big problem here... >>>>>> Corosync saves its ring persistently anyways. >>>>>> >>>>>>> >>>>>>> There is still a window where the file containing the expected_votes >>>>>>> from "seen" list is corrupted tho. At that point it´s difficult to >>>>>>> detect which of the two information is correct and it doesn´t prevent >>>>>>> the issue at all if the file is removed entirely (even by accident), but >>>>>>> at a first shot i would say that it is better than nothing. >>>>>> >>>>>> Hopefully at least not all nodes from a fenced partition will have it >>>>>> corrupted/deleted. They should honor the maximal ev value from them all. >>>>>> >>>>>>> >>>>>>> I´ll have a test and see how it pans out but at a first glance I think >>>>>>> we should only store the last known expected_votes while quorate. >>>>>>> The node booting would use the higher of the two values. If the cluster >>>>>>> has decreased in size in the meantime, the node joining would be >>>>>>> informed about it (just sent a patch to the list about it 10 minutes ago ;)) >>>>> >>>>> so I am 99% done with this patch, by saving highest expected_votes and >>>>> so on, but there is a corner case I am not entirely sure how to handle. >>>>> >>>>> Let´s take an example. >>>>> >>>>> 8 nodes cluster (each node votes 1 for simplicity) >>>>> expected_votes set to 3 >>>>> >>>>> 3 nodes are happily running and all... >>>>> >>>>> increase to 8 nodes >>>>> >>>>> new expected_votes is 8 (and we remember this by writing it on file). >>>>> >>>>> we scale back to 3 nodes at this point. >>>> >>>> This is a little bit unclean for me. >>> >>> Maybe I didn't explain it properly. Let me try again. >>> >>>> According to your last work, I suppose you mean that 5 nodes are just >>>> cleanly shut down, and cluster reduces expected votes and quorum >>>> accordingly. >>>> >>>> I do not have a strong PoV on leave_remove feature yet. On the one hand >>>> it is handy. On the other it is dangerous at least, and the corner case >>>> you talk about highlights this. After several hours of brainstorm I do >>>> not see any clean solution for this case. Except to not allow automatic >>>> expected votes decrease at all. >>> >>> leave_remove requires a perfect clean node shutdown to work. Otherwise >>> ev is not recalculated. Node starts to shutdown, sends a message to the >>> other cluster node that it is leaving and the other nodes "downscale", >>> but how this happen is irrelevant to this problem and dangerous no. It's >>> something cman had for ages and worked pretty well. >> >> I still not convinced and would prefer manual deletion... > > Sure but leave_remove is never enabled by default. It´s a user choice, > like enabling highest_seen_tracking. Nothing says they need be used in > combo. It just makes my test easier by removing nodes automatically > instead of doing it manually. > >> >>> >>> The point was to reproduce your original use case: >>> >>> Start with 3, scale up to 17 and then go back to 3. >>> >>> Once you are back to 3, highest_ev is 17 (for now I didn't allow hev >>> downscale/override yet and needs fixing for other use cases). >>> >>> The process you used to go back to 3 is irrelevant (either manual or via >>> leave_remove). With the final result that we want to avoid any of the >>> shutdown node to gain quorum in a partition. >> >> Main goal is to avoid data corruption (prevent from two quorate >> partitions) I think. Stability is little bit less important here. > > Yeps, that´s why we try to implement those barriers. We are on the same > page regarding goals here. > >> >>> >>>> >>>> One possibility just came to mind two seconds ago: what if we just not >>>> allow expected_votes to go below quorum based on higher_ever_seen? >>>> That would help a lot, although it introduces >>>> not-very-clean-for-everyone logic. It is just a raw idea, without any >>>> logical background. Does it solve the problem? Comments are welcome. >>>> >>>> I mean, if you have higher_ever_seen:8, then expected_votes (runtime) >>>> should not go below 8/2+1=5. >>>> Of course this will raise handful of reports from users unless it is >>>> documented IN CAPS with !!!!!!!!dozens of exclamation marks!!!!!!!! >>>> (more than one time ;) ). >>> >>> Hmmm that is an interesting approach yes but it is indeed rather >>> confusing for the final user. >> >> That is what I say above ;) >> >>> >>> "Yes you can start with 3, scale up to N, but you can't go below >>> quorum(N)..." >> >> Unless you do manual intervention. > > Clearly. manual intervention is always there but we need to workaround > what can be done automagically. > >> >>> >>>> >>>> So, to set it lower one needs to some-how edit higher_ever_seen. >>>> >>>> Frankly speaking, my previous suggestion (keep persistent list of >>>> cluster members) still valid for this. And I still really like it. Admin >>>> would just say: >>>> "Hey, node6 is not longer a part of cluster, please delete it from >>>> everywhere." Node is removed from cmap and then higher_ever_seen is >>>> recalculated automatically. Without this admin needs to "calculate" that >>>> value. And even this simple arithmetic can be error-prone in some >>>> circumstances (time pressure, sleepless night at work, etc.). >>> >>> I still haven't integrated the highest_ever_seen calculation with >>> nodelist (tho it's easy) or "downgrading" of highest_ever_seen. >>> >>> The persistent list doesn't help me at all in this case. >>> highest_ever_seen can only increase at this point in time, and >>> eventually it can be downgraded manually (or via nodelist editing). >> >> It would help to avoid a mess when node votes are not same for all >> nodes. I'm sure that I will make mistake when I need to recalculate >> something in a such heterogeneous cluster manually. But if I just say >> "ok, node X is not supposed to be active any longer, please delete it >> from any calculations", then chance for mistake is lower by the order of >> magnitude. > > Ok, let´s try to recap this one second because i see this from the > votequorum internal calculation/code perspective and you from final user > point of view (that is good so we can find gaps ;)). Great. > > My understanding is that: > > N node cluster, where votes are not even. > > At some point in time you shutdown node X (that votes something) > > That nodeX is marked "DEAD" in the node list and it stops voting. > nodeX votes are still used to calculate expected_votes. > > Now, you want to tell votequorum that nodeX is gone and recalculate > expected_votes. > > You have two options: > > 1) temporary remove the node from the calculation: > > corosync-quorumtool -e $value > > where value can be anything below current votes (just enter 1 to make > your life simpler), will pull down expected_votes to current node votes. > > Given that expected_votes can never be lower than total_votes in the > current cluster, votequorum will do the calculation for you correctly. Although it is not intuitive and has some implicit not-very-clean-initially logic, I can probably live with that. Alternative more intuitive command would be "shrink expected_votes to current total_votes". > > 2) remove the node forever from the nodelist and votequorum will pick it > up automatically. > > $somecorosync-tool-magic-i-dont-know-the-syntax > > same property for expected_votes apply here and it will be recalculated > for you. Do you mean dynamic removal of node from a config file, or just from internal in-process list? Former is a no-go I'd say, latter returns us back to list of "seen" nodes, otherwise cluster restart returns you to a previous state. > > highest_seen_votes can then be lowered to current expected_votes in this > case since it is an admin request to lower or change everything. > > Either way, internally, i don´t need to exchange the list of seen nodes > because either the nodelist from corosync.conf _or_ the calculation > request will tell me what to do. For me it is always preferred to have important statements listed explicitly. Implicit ones always leave chance to be interpreted incorrectly, Look: "You have cluster of max 8 nodes with max 10 votes, and 4 of them with 5 votes are known to be active. I wont say which ones, just trust me." "You have cluster of max 8 nodes, and nodes A, B, C, D are active. Nodes E, F, G, H are not active. A and E has two votes each, all others have one vote each." I would always prefer latter statement. (This example has nothing to split-brain discussion, just an implicit vs. explicit example) > >> >>> >>>> >>>> The most important point here is that (still) possible split-brain is >>>> caused not by a software decision but by the admin's action. You >>>> understand what does it mean for support (and for judges in the worst >>>> case ;) ). >>> >>> Right, we are on the same page here. >>> >>> In my example we can protect users against being "stupid" up to >>> quorum(highest_expected_votes) basically. Can we do better than that? >> >> I wouldn't say we can do anything better. >> >>> >>> so if you have >>> 17 nodes, >>> hev is 17 >>> quorum(hev) = 9 >>> >>> The admin can "safely look good" by powering on by mistake up to 8 >>> nodes, but if he fires up 9, then the new quorate partition will fence >>> the old one running services. >> >> Only if you have startup fencing enabled. Otherwise you end up with data >> corruption again. >> And, even with startup fencing enabled you'll get fencing war after old >> partition reboots back. >> >> I really doubt we can easily avoid this. > > Ok, i guess we are heading to the same conclusions that this is pretty > much the case of "don´t execute rm -rf /". We can only protect users up > to a certain point, if they like to shoot themselves we can´t do > anything about it. Some bits of documentation with advises would be nice to have. F.e. It is not 100% safe to have expected_votes set manually to a value less than N/2+1 where N is total number of votes from all possible cluster members (highest possible quorum value). If you still want to do that, then you need to guarantee that cluster never partitions (f.e. with redundant ring configuration). Otherwise there is possibility to have all your data corrupted. Ugh! What if just deny that^ by default for one-ring config? Possibly with some magic hard-to-configure parameter (md5/sha from corosync key file?) to allow operation even on one ring (as it is possible to have fine-crafted network setup which provides guaranties even with one ring - f.e. LACP bonding over switch-stack, where same bonds are used for *both* cluster communication and data access, and every node is connected to at least two different stack members). I'd also some-how recommend that even with redundant ring cluster should never be put into a "undetermined" state by powering-off old partition, powering-on new one and then powering-on old one again. Do not know why, but I feel that dangerous. May be my feeling is not valid. Vladislav _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss