Re: Using qdisk and a watchdog timer to eliminate power fencing single point of failure?

Jonathan Biggar <jon@xxxxxxxxxxxxxx> · Sat, 23 Feb 2008 09:06:58 -0800

Lon Hohberger wrote:
I've got a deployment scenario for a two node cluster (where services 
are configured as active and standby) where the customer is concerned 
that the external power fencing device I am using (WTI) becomes a single 
point of failure.  If the WTI for the active node dies, taking down the 
active node, the standby cannot bring up services because it cannot 
successfully fence the failed node.  This leaves the cluster down.

Correct.  Although, if you plug in a serial terminal server, I have a
patch to talk to WTI switch through a terminal server in case the server
gets unjacked, though.

Actually, I'm more worried about a WTI that blows up, taking the active 
node with it.  A terminal server won't help with that.

In the setup, storage fencing is not feasible as a backup for power fencing.

Not even using fence_scsi? (SCSI3 reservations)?  That's unfortunate :(

Well, it's possible, but this solution may be deployed with many 
different SAN implementations, so I was hoping to find a way to avoid 
having to certify that each SAN does SCSI reservations correctly.

I think I've worked out a scenario using qdiskd and the internal 
hardware watchdog timers in our nodes to use as a backup for power 
fencing that I hope will eliminate the single point of failure.

Hardware watchdog timers = good stuff.

Here's how I see it working:

2.  Create a heuristic (besides the usual network reachability test) for 
qdisk that resets the node's hardware watchdog timer.  (I'll have to do 
some additional work to ensure that the watchdog gets turned off if I am 
gracefully shutting down the node's qdisk daemon.)

There's a watchdog daemon (userspace code) that lets you configure
heuristics for it.  Most are internal to it - and are therefore superior
to how qdiskd does heuristics from a HA / memory-neutrality perspective.
If some heuristic(s) are not met, the daemon can at your option stop
touching the watchdog device.

There's an open bugzilla to provide an integration path between qdiskd
and watchdogd - so that you can configure heuristics for watchdogd and
have qdiskd base its state on those.

For example, if watchdogd says "ok, we're not updating the watchdog
driver because of X", qdiskd can trigger a self-demotion off of that, or
maybe even write a 'If you don't hear from me in X seconds, consider me
dead' message to disk...?

That looks like good stuff, I'll look into it.  From looking at 
watchdogd, it can monitor if a file gets updated, so it's easy to 
integrate quorumd and watchdogd in a simple fashion by just having a 
quorumd heuristic that touches a file.

3.  Create a custom fencing script that is run if power fencing fails 
that examines qdisk's state to see if the node that needs to be fenced 
is no longer updating the quorum disk.

I think the easiest thing to do is make a quick, small-footprint API or
utility to talk to qdiskd to get states...

That's what I figured.

(I'm not sure how to do this--I 
hope that the information in stored in qdisk's status_file will be 
sufficient to determine this, if not, I might have to modify qdisk to 
supply what I need.)

... because status_file is *sketchy* at best (really, it's a debugging
tool). ;)

I was afraid of that...

The standby node then should be sure that the active node has rebooted 
itself either by qdiskd's action or via the watchdog timer, or else it 
is power dead.

Can anyone see a weakness in this approach I haven't thought of?

It's good from a best-effort standpoint.  We don't have anything that
does 'best effort' fencing - it's mostly all black/white.

A question that comes up is: if we use the watchdog + watchdog daemon,
do we need qdisk at all?  I mean, if there's an 'eventual timeout'
anyway based on the expectancy that the watchdog timer will fire and we
rely on it - why bother with the intermediate steps?

Hardware watchdog timers are going to be more reliable than just about
anything qdiskd could provide.

Ok, I get it.  It's probably a couple of orders of magnitude more 
reliable, but since it relies only on timing, there's no real *positive* 
indication that the fencing succeeded, so it's really only best-effort. 
 Even though it would take three failures (network disruption of 
heartbeat, quorumd failing to reboot the node and the watchdog timer 
failing as well), there's still a slim, slim chance that the node is 
still trying to write to the SAN.  If I want to guarantee that there's 
never a split brain, then this isn't good enough.

Thanks for the advice.

--
Jon Biggar
Floorboard Software
jon@xxxxxxxxxxxxxx
jon@xxxxxxxxxx

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster