Re: fencing: external vs watchdog

Lon Hohberger <lhh@xxxxxxxxxx> · Fri, 17 Aug 2007 13:34:22 -0400

On Fri, Aug 17, 2007 at 01:03:03PM +0200, Maciej Bogucki wrote:
> 1. Watchodog is a piece of code which run in user space, so You don't
> have 100% guarantee that it will run correctly.

Some clarification:

Traditionally, a watchdog is a piece of hardware which a userland
daemon writes to periodically.  Failure to write to the piece of
hardware after a set time causes a system reset (the app holding the
watchdog open crashing is one obvious way to cause this to happen).

The Linux kernel also has a software watchdog (called softdog) which
operates in the kernel using the same API it exposes for hardware
watchdogs.

The watchdog daemon (Debian, RHEL5.1, etc.) is one implementation of the
userland part of code which is well-known and often confused with being
a watchdog timer itself.  It monitors administrator-defined resources
and touches the watchdog timer device periodically if things are "ok"
and stops if things go bad (stopping causes the WD to fire).

The point here is that it doesn't matter if the userspace code fails,
blows up, or otherwise - the *failure* mode for a watchdog timer is to
reset the system.

> 2. Watchdog fencing can't protect You against split-brain situations,
> where the consequences could be corruption of You data. Here comes
> external fencing.

You can (at least, mostly) solve this if you have alternative mechanisms
for cluster communications (ex: a quorum disk on a SAN and/or using
external tie-breakers/ping-nodes/whatever).

However - more inline with your point - it's not simple, and it relies
on a lot of assumptions.

> There is another point of view about Linux Clusters and other Commercial
> Clusters(fe. Sun Cluster). Linux Cluster resist in user-space so You
> don't have guarantee that local fencing will run ok, and You need
> exteral fencing to resolve this main problem. Sun Cluster resist in
> kernel-space, so when one node lost quorum it do "kernel panic" and You
> have 100% guarantee that it will success.

FWIW, the hardware watchdog timer is outside of the operating system
entirely.  The entire kernel could hang/crash and the watchdog would
still fire.

Most of the reason for fencing (at all) is the notion of a live-hang of
an indefinite period of time - where a node just stops for a few seconds
due to a kernel bug or for some other reason.  If the whole kernel stops
for a few seconds, the node won't know it's no longer in the quorum, or
calling panic() could be delayed.

There are kernel hangcheck timers, but as I understand it, they're racy:
You can not guarantee that the hang-check will complete before an
outstanding I/O is flushed to disk.  I could be wrong here.

> For me network fencing(IPMI,DRAC,...) isn't good, because You have to
> connect via network and it could fail, and so on. The best fencing
> mechanism is fence_scsi, which is an I/O fencing agent. I can be used
> with the SCSI devices that support persistent reservations (SPC-2 or
> greater). In more cases You have shares storages taht support SPC-2 or
> SPC-3.

Yup.  You can also use FC zoning, in addition to fence_scsi if you
want.

The biggest thing about not using watchdog timers as 'fencing' is
that it's complex and difficult to do correctly/reliably, especially in
the two-node case.

-- Lon

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster