Hello list,
We know that CTDB uses lockfile in the cluster file system to prevent split-brain.
It is a really good design when all nodes in the cluster can mount the cluster file system (e.g. GPFS/GFS/GlusterFS) and CTDB can work happily in this assumption.
However, when split-brain happens, the disconnected private network violates this assumption usually.
For example, we have four nodes (A, B, C, D) in the cluster and GlusterFS is the beckend.
GlusterFS and CTDB on all nodes communicate to each other via private network and CTDB manages the public network.
If node A is disconnected in the private network, there will be group (A) and group (B,C,D) in our cluster.
The election of recovery master will be triggered after the disconnected determination of CTDB, i.e. the CTDB elects a new recovery master for each group after 26 (KeepaliveInterval*KeepaliveLimits+1 by default) seconds.
Then node A will be the recovery master of group (A) and some node (e.g. B) will be the recovery master of group (B,C,D).
Now, A and B will try to lock the lockfile but GlusterFS also communicates to each other via private network.
A big problem arises since the lockfile can be locked or not depends on the lock implementation and disconnected determination of GlusterFS (or other cluster file system). In my knowledge, GlusterFS will determine some node is disconnected after 42 seconds and release its lock. In this configuration, node A and B will ban themselves and the newly elected recovery master will ban itslef. It's a really bad thing and we can not treat the cluster file system as a blackbox using the lockfile design.
Hence, I have an idea about the opportunity to build CTDB with split-brain prevention without lockfile.
Using quorum concepts to ban a node might be an option and I do a little modification of the CTDB source code.
The modification checks whether there are more than (nodemap->num)/2 connected nodes in main_loop of server/ctdb_recoverd.c.
If not, ban the node itslef and logs an error "Node %u in the group without quorum".
In server/ctdb_recoverd.c:
static void main_loop(struct ctdb_context *ctdb, struct ctdb_recoverd *rec, TALLOC_CTX *mem_ctx)
...
/* count how many active nodes there are */
rec->num_active = 0;
rec->num_connected = 0;
for (i=0; i<nodemap->num; i++) {
if (!(nodemap->nodes[i].flags & NODE_FLAGS_INACTIVE)) {
rec->num_active++;
}
if (!(nodemap->nodes[i].flags & NODE_FLAGS_DISCONNECTED)) {
rec->num_connected++;
}
}
+ if (rec->num_connected < ((nodemap->num)/2+1)){
+ DEBUG(DEBUG_ERR, ("Node %u in the group without quorum\n", pnn));
+ ctdb_ban_node(rec, pnn, ctdb->tunable.recovery_ban_period);
+ }
This modification seems to provide a split-brain prevention without lockfile in my tests(more than 3 nodes).
Does this modification cause any side-effect or is that a stupid design?
Please kindly answer me and I appreciate to receive new inputs from smart people like you guys.
Thanks,
Az
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster