I've got a deployment scenario for a two node cluster (where services
are configured as active and standby) where the customer is concerned
that the external power fencing device I am using (WTI) becomes a single
point of failure. If the WTI for the active node dies, taking down the
active node, the standby cannot bring up services because it cannot
successfully fence the failed node. This leaves the cluster down.
In the setup, storage fencing is not feasible as a backup for power fencing.
I think I've worked out a scenario using qdiskd and the internal
hardware watchdog timers in our nodes to use as a backup for power
fencing that I hope will eliminate the single point of failure.
Here's how I see it working:
1. Configure a quorum disk for the two nodes.
2. Create a heuristic (besides the usual network reachability test) for
qdisk that resets the node's hardware watchdog timer. (I'll have to do
some additional work to ensure that the watchdog gets turned off if I am
gracefully shutting down the node's qdisk daemon.)
3. Create a custom fencing script that is run if power fencing fails
that examines qdisk's state to see if the node that needs to be fenced
is no longer updating the quorum disk. (I'm not sure how to do this--I
hope that the information in stored in qdisk's status_file will be
sufficient to determine this, if not, I might have to modify qdisk to
supply what I need.) This script will wait until it sees that the
failed node has not updated the quorum disk for longer than the watchdog
timer setting plus a safety margin, and then return success. If it sees
the other node is still updating the quorum disk, it will return failure.
The standby node then should be sure that the active node has rebooted
itself either by qdiskd's action or via the watchdog timer, or else it
is power dead.
Can anyone see a weakness in this approach I haven't thought of?
--
Jon Biggar
Floorboard Software
jon@xxxxxxxxxxxxxx
jon@xxxxxxxxxx
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster