Hi Fabio, 09.02.2012 18:47, Fabio M. Di Nitto wrote: > Hi Vladislav, > > On 1/27/2012 10:46 PM, Vladislav Bogdanov wrote: >> 26.01.2012 15:41, Fabio M. Di Nitto wrote: >>> On 1/26/2012 1:15 PM, Vladislav Bogdanov wrote: >>> >>>>>>> Probably even not lower than number of votes from nodes which are now >>>>>>> either active or inactive but joined at least once (I suppose that >>>>>>> nodelist is fully editable at runtime, so admin may some-how reset join >>>>>>> count of node and only than reduce expected_votes). >>>>> >>>>> I have been thinking about this some more, but I am not sure I grasp the >>>>> use case or what kind of protection you try to suggest. >>>>> >>>>> Reducing the number of expected_votes is an admin action, it´s not very >>>>> different from removing a node from the "seen" list manually and >>>>> recalculating expected_votes. >>>>> >>>>> Can you clarify it for me? >>>> >>>> Imagine (this case is a little bit hypothetical, but anyways): >>>> * You have cluster with 8 active nodes, and you (for some historical >>>> reasons or due to admin fault/laziness) have expected_votes set to 3 >>>> (ok, you had 3-node cluster not so long ago, but added more nodes >>>> because of growing load). >>>> * Cluster splits 5+3 due to loss of communication between switches (or >>>> switch-stacks). >>>> * 3 nodes are fenced. >>>> * Partition with majority continues operation. >>>> * 3 fenced nodes boot back, and form *quorate* partition because they >>>> have expected_votes set to 3 >>>> * Data is corrupted >>>> >>>> If fenced nodes know right after boot that cluster consists of 8 active >>>> nodes, they would not override expected_votes obtained from the >>>> persistent "seen" list with the lower value from the config, and the >>>> data is safe. >>> >>> Oh great.. yes I see where you are going here. It sounds an interesting >>> approach but that clearly requires a file where to store those information. >> >> I do not see a big problem here... >> Corosync saves its ring persistently anyways. >> >>> >>> There is still a window where the file containing the expected_votes >>> from "seen" list is corrupted tho. At that point it´s difficult to >>> detect which of the two information is correct and it doesn´t prevent >>> the issue at all if the file is removed entirely (even by accident), but >>> at a first shot i would say that it is better than nothing. >> >> Hopefully at least not all nodes from a fenced partition will have it >> corrupted/deleted. They should honor the maximal ev value from them all. >> >>> >>> I´ll have a test and see how it pans out but at a first glance I think >>> we should only store the last known expected_votes while quorate. >>> The node booting would use the higher of the two values. If the cluster >>> has decreased in size in the meantime, the node joining would be >>> informed about it (just sent a patch to the list about it 10 minutes ago ;)) > > so I am 99% done with this patch, by saving highest expected_votes and > so on, but there is a corner case I am not entirely sure how to handle. > > Let´s take an example. > > 8 nodes cluster (each node votes 1 for simplicity) > expected_votes set to 3 > > 3 nodes are happily running and all... > > increase to 8 nodes > > new expected_votes is 8 (and we remember this by writing it on file). > > we scale back to 3 nodes at this point. This is a little bit unclean for me. According to your last work, I suppose you mean that 5 nodes are just cleanly shut down, and cluster reduces expected votes and quorum accordingly. I do not have a strong PoV on leave_remove feature yet. On the one hand it is handy. On the other it is dangerous at least, and the corner case you talk about highlights this. After several hours of brainstorm I do not see any clean solution for this case. Except to not allow automatic expected votes decrease at all. One possibility just came to mind two seconds ago: what if we just not allow expected_votes to go below quorum based on higher_ever_seen? That would help a lot, although it introduces not-very-clean-for-everyone logic. It is just a raw idea, without any logical background. Does it solve the problem? Comments are welcome. I mean, if you have higher_ever_seen:8, then expected_votes (runtime) should not go below 8/2+1=5. Of course this will raise handful of reports from users unless it is documented IN CAPS with !!!!!!!!dozens of exclamation marks!!!!!!!! (more than one time ;) ). So, to set it lower one needs to some-how edit higher_ever_seen. Frankly speaking, my previous suggestion (keep persistent list of cluster members) still valid for this. And I still really like it. Admin would just say: "Hey, node6 is not longer a part of cluster, please delete it from everywhere." Node is removed from cmap and then higher_ever_seen is recalculated automatically. Without this admin needs to "calculate" that value. And even this simple arithmetic can be error-prone in some circumstances (time pressure, sleepless night at work, etc.). The most important point here is that (still) possible split-brain is caused not by a software decision but by the admin's action. You understand what does it mean for support (and for judges in the worst case ;) ). Best, Vladislav > > expected_votes (runtime): 3 > quorum: 2 > > higher_ever_seen: 8 > quorum: 5 (based on highest_ever_seen) > > in the simplest scenario where 3/4 nodes boot up in a separate > partition, we are good, the partition would not be quorate. > > But in the worst case scenario, where 5 nodes boot up in a separate > partitions, those can actually become quorate. > > That´s clearly bad. > > I can´t really think of a way to avoid 2 partitions to go quorate at > this point. I know it´s a rather extreme corner case, but it is a case > that can happen (and be sure customers will make it happen ;)) > > Any suggestions? > > Thanks > Fabio > > PS this patch + the recently posted leave_remove: option would give you > 100% freedom to scale back and forth without even touching ev at > runtime. Just need to solve this case.... _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss