14.01.2012 11:57, Fabio M. Di Nitto wrote: > On 01/14/2012 09:09 AM, Vladislav Bogdanov wrote: >> Hi, >> >> 13.01.2012 21:21, Fabio M. Di Nitto wrote: >> [snip] >>>> + expected_votes is removed and instead auto-calculated based upon >>>> quorum_votes in the node list >> >> Is it possible to dynamically keep track of "seen" nodes here and use >> only that nodes for expected_votes calculations? >> >> I even have a use-case for that: >> I "run" cluster consisting of max 17 nodes with UDPU, so all nodes are >> listed in config. Currently only 3 nodes are powered on, because I do >> not have load which requires more yet (and power is expensive in >> european datacenters). When load increases I'd just power on additional >> nodes and quorum expectations are recalculated automagically. >> I have that implemented with corosync + pacemaker right now. Pacemaker >> keeps that list of nodes and does quorum calculations correctly. And I'm >> absolutely happy with that. From what I see changes being discussed will >> break my happiness. > > Yes and no. Let me explain: > > votequorum already does that internally. For example: > > expected_votes: 3 in corosync.conf > > you power on your 4th node (assuming everybody votes 1 to make it > simpler in this example) and expected_votes is automatically bumped to 4 > on all nodes. > > While this is what you are asking for, there are a few corner cases > where this could lead to a dangerous situation. > > First of all the new expected_votes is not written to disk but only > retained internally to votequorum. Actually I'd prefer it to be written together with that "seen" list, so cluster knows who should be here even after a full restart. But I neither like idea having it in a config nor being calculated based on a nodelist. From my point of view, that is not a configuration variable, but rather a "state" one. And it should be managed in a stateful way (saved to disk), > > This approach does not protects you against partitions properly. > Specially at startup of the cluster. For example, out of 16 nodes, 8 are > on switch A and 8 on switch B. Interlink between switches is broken. All > nodes know of expected_votes: 3 from corosync.conf If we have expected_votes not in config, but in some file in /var/lib (corosync already does the same for rings), and managed dynamically cluster-wide, that should be impossible (of course if admin didn't delete that file on all nodes). Cluster knows it has 16 active nodes. It even knows all its ever "seen" members. > > Both partitions of the cluster can achieve quorate status and they can > create caos fencing each other, data corruption and all. Now, we agree > that this is generally an admin error that doesn't notice that the > interlink is down, but.. it leaves a window open for disasters. > > On the other side, i am not going to force users to do it differently. > Current votequorum implementation allows this use case, and i am not > going to enforce differently. Users should still be aware of they are > asking for tho. > >> >> It would also be great if I'm able to forcibly remove inactive node from >> that "seen" list with just one command on *one* cluster node. Use case >> for that is a human error when wrong node is powered on by mistake. > > The "seen" list within the quorum module is dynamic. As soon as you > shutdown a node (either clean or whatever) and totem notices that the > node goes away, that same node is removed from the quorum calculation. Ugh? Do you mean that dynamic version of expected_votes is decremented automatically? > > Your concern is probably related to the discussed nodelist, but that's > up to others to decide "how" to handle add/removal of nodes. It doesn't > affect the quorum module at all. Generally yes, it is about nodelist, vote-list and expected_votes in a config file. All I wanted to say is that I'm pretty happy with how pacemaker implements quorum management (from admin's point of view). If I power on more "unseen" nodes, expected_votes is automatically incremented and saved into CIB. If I then power down that nodes, their votes are still considered until I remove them from CIB and decrement expected_votes manually (actually that part didn't fully work last time I checked). And I do not like idea of touching configuration file every time I want to add node to cluster. And then re-distributing that config over all nodes, and then reload it on every node. Now I have all 17 nodes listed in corosync.conf (UDPU), but my expected_votes in pacemaker CIB is 3. That's why Steve's idea of calculating expected_votes from a vote-list would be a regression for me. Vladislav _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss