Hi Vladislav, On 1/27/2012 10:46 PM, Vladislav Bogdanov wrote: > 26.01.2012 15:41, Fabio M. Di Nitto wrote: >> On 1/26/2012 1:15 PM, Vladislav Bogdanov wrote: >> >>>>>> Probably even not lower than number of votes from nodes which are now >>>>>> either active or inactive but joined at least once (I suppose that >>>>>> nodelist is fully editable at runtime, so admin may some-how reset join >>>>>> count of node and only than reduce expected_votes). >>>> >>>> I have been thinking about this some more, but I am not sure I grasp the >>>> use case or what kind of protection you try to suggest. >>>> >>>> Reducing the number of expected_votes is an admin action, it´s not very >>>> different from removing a node from the "seen" list manually and >>>> recalculating expected_votes. >>>> >>>> Can you clarify it for me? >>> >>> Imagine (this case is a little bit hypothetical, but anyways): >>> * You have cluster with 8 active nodes, and you (for some historical >>> reasons or due to admin fault/laziness) have expected_votes set to 3 >>> (ok, you had 3-node cluster not so long ago, but added more nodes >>> because of growing load). >>> * Cluster splits 5+3 due to loss of communication between switches (or >>> switch-stacks). >>> * 3 nodes are fenced. >>> * Partition with majority continues operation. >>> * 3 fenced nodes boot back, and form *quorate* partition because they >>> have expected_votes set to 3 >>> * Data is corrupted >>> >>> If fenced nodes know right after boot that cluster consists of 8 active >>> nodes, they would not override expected_votes obtained from the >>> persistent "seen" list with the lower value from the config, and the >>> data is safe. >> >> Oh great.. yes I see where you are going here. It sounds an interesting >> approach but that clearly requires a file where to store those information. > > I do not see a big problem here... > Corosync saves its ring persistently anyways. > >> >> There is still a window where the file containing the expected_votes >> from "seen" list is corrupted tho. At that point it´s difficult to >> detect which of the two information is correct and it doesn´t prevent >> the issue at all if the file is removed entirely (even by accident), but >> at a first shot i would say that it is better than nothing. > > Hopefully at least not all nodes from a fenced partition will have it > corrupted/deleted. They should honor the maximal ev value from them all. > >> >> I´ll have a test and see how it pans out but at a first glance I think >> we should only store the last known expected_votes while quorate. >> The node booting would use the higher of the two values. If the cluster >> has decreased in size in the meantime, the node joining would be >> informed about it (just sent a patch to the list about it 10 minutes ago ;)) so I am 99% done with this patch, by saving highest expected_votes and so on, but there is a corner case I am not entirely sure how to handle. Let´s take an example. 8 nodes cluster (each node votes 1 for simplicity) expected_votes set to 3 3 nodes are happily running and all... increase to 8 nodes new expected_votes is 8 (and we remember this by writing it on file). we scale back to 3 nodes at this point. expected_votes (runtime): 3 higher_ever_seen: 8 quorum: 5 in the simplest scenario where 3/4 nodes boot up in a separate partition, we are good, the partition would not be quorate. But in the worst case scenario, where 5 nodes boot up in a separate partitions, those can actually become quorate. That´s clearly bad. I can´t really think of a way to avoid 2 partitions to go quorate at this point. I know it´s a rather extreme corner case, but it is a case that can happen (and be sure customers will make it happen ;)) Any suggestions? Thanks Fabio PS this patch + the recently posted leave_remove: option would give you 100% freedom to scale back and forth without even touching ev at runtime. Just need to solve this case.... _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss