Reviewed-by: Steven Dake <sdake@xxxxxxxxxx> On 01/26/2012 06:27 AM, Fabio M. Di Nitto wrote: > From: "Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> > > it is not correct to randomly accept expected_votes from any node in > the cluster. We can only allow expected_votes from quorate nodes. > > A quorate cluster is "always" right and have the correct expected_votes. > > One of the different bug triggers: > > quorum { > expected_votes: 8 > auto_tie_breaker: 1 > last_man_standing: 1 > } > > start all 8 nodes. > clean shut down 2 nodes. > wait for lms to kick in. > kill 3 nodes with highest nodeid > (we want to retain a quorate partition of 3 nodes) > start one node again -> cluster will be unquorate > > This happens because the node rebooting/rejoining with > non current cluster status will propagate an expected_votes of 8, > while in reality the cluster is down to expected_votes: 3. > > 4 nodes are still < 5 (quorum for 8 nodes/votes). > > In order to avoid this condition, we need to exchange expected_votes > information among nodes but we cannot randomly trust everybody. > > 1) Allow expected_votes to be changed cluster-wide only if the > information is coming from a quorate node. > 2) Fix node->expected_votes based on quorate status > 3) allow a joining node to decrease quorum and expected_votes > if the node is not yet quorate, but it's joining a quorate > cluster > > Signed-off-by: Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> > --- > exec/votequorum.c | 16 ++++++++++++++-- > 1 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/exec/votequorum.c b/exec/votequorum.c > index 47132d6..798746a 100644 > --- a/exec/votequorum.c > +++ b/exec/votequorum.c > @@ -1016,6 +1016,7 @@ static void message_handler_req_exec_votequorum_nodeinfo ( > int old_expected; > nodestate_t old_state; > int new_node = 0; > + int allow_downgrade = 0; > > ENTER(); > > @@ -1038,9 +1039,20 @@ static void message_handler_req_exec_votequorum_nodeinfo ( > > /* Update node state */ > node->votes = req_exec_quorum_nodeinfo->votes; > - node->expected_votes = req_exec_quorum_nodeinfo->expected_votes; > node->state = NODESTATE_MEMBER; > > + if ((!cluster_is_quorate) && > + (req_exec_quorum_nodeinfo->quorate)) { > + allow_downgrade = 1; > + us->expected_votes = req_exec_quorum_nodeinfo->expected_votes; > + } > + > + if (req_exec_quorum_nodeinfo->quorate) { > + node->expected_votes = req_exec_quorum_nodeinfo->expected_votes; > + } else { > + node->expected_votes = us->expected_votes; > + } > + > log_printf(LOGSYS_LEVEL_DEBUG, "nodeinfo message: votes: %d, expected: %d wfa: %d quorate: %d", > req_exec_quorum_nodeinfo->votes, > req_exec_quorum_nodeinfo->expected_votes, > @@ -1064,7 +1076,7 @@ static void message_handler_req_exec_votequorum_nodeinfo ( > old_votes != node->votes || > old_expected != node->expected_votes || > old_state != node->state) { > - recalculate_quorum(0, 0); > + recalculate_quorum(allow_downgrade, 0); > } > > if (!nodeid) { _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss