Re: Fence device, How it work

Michael Will <mwill@xxxxxxxxxxxxxxxxxxxx> · Tue, 08 Nov 2005 16:46:40 -0800

I was more thinking along those lines:

1. node A fails
2. node B reboots node A
3. node A fails again because it has not been fixed.

now we could have a 2-3-2 loop. worst case situation is
that 3. is actually
3.1 node A comes up and starts reaquiring its ressource
3.2 node A fails again because it has not been fixed
3.3 goto 2

Your recommendation f/g is exactly what I was wondering about
as an alternative. I know it is possible but try to understand
why it would not be the default behavior.

In active/passive heartbeat style setups I set the nice-failback
option so it does not try to reclaim ressources unless the other
node fails, but I wonder what is the best path in a multinode
active/active setup.

Michael

Lon Hohberger wrote:
On Tue, 2005-11-08 at 07:52 -0800, Michael Will wrote:

Power-cycle. 

I always wondered about this. If the node has a problem, chances are 
that rebooting does not
fix it. Now if the node comes up semi-functional and attempts to regain 
control over the ressource
that it owned before, then that could be bad. Should it not rather be 
shut-down so an human intervention
can fix it before it is being made operational again?

This is a bit long, but maybe it will clear some things up a little.  As
far as a node taking over a resource it thinks it still has after a
reboot (without notifying the other nodes of its intentions), that would
be a bug the cluster software, and a really *bad* one too!

A couple of things to remember when thinking about failures and fencing:

(a) Failures are rare.  A decent PC has something like a 99.95% uptime
(I wish I knew where I heard/read this long ago) uptime - with no
redundancy at all.  A server with ECC RAM, RAID for internal disks, etc.
probably has a higher uptime.

(b) The hardware component most likely to fail is a hard disk (moving
parts).  If that's the root hard disk, the machine probably won't boot
again.  If it's the shared RAID set, then the whole cluster will likely
have problems.

(c) I hate to say this, but the kernel is probably more likely to fail
(panic, hang) than any single piece of hardware.

(d) Consider this (I think this is an example of what you said?):
    1. Node A fails
    2. Node B reboots node A
    3. Node A correctly boots and rejoins cluster
    4. Node A mounts a GFS file system correctly
    5. Node A corrupts the GFS file system

What is the chance that 5 will happen without data corruption occurring
during before 1?  Very slim, but nonzero - which brings me to my next
point...

(e) Always make backups of critical data, no matter what sort of block
device or cluster technology you are using.  A bad RAM chip (e.g. an
parity RAM chip missing a double-bit errors) can cause periodic, quiet
data corruption.  Chances of this happening are also very slim, but
again, nonzero.  Probably at least as likely to happen as (d).

(f) If you're worried about (d) and are willing to take the expected
uptime hit for a given node when that node fails, even given (c), you
can always change the cluster configuration to turn "off" a node instead
of reboot it. :)

(g) You can chkconfig --del the cluster components so that they don't
automatically start on reboot; same effect as (f): the node won't
reacquire the resources if it never rejoins the cluster...

I/O fencing instead of power fencing kind of works like this, you undo 
the i/o block once you know
the node is fine again.

Typically, we refer to that as "fabric level fencing" vs. "power level
fencing", both fit in with the I/O fencing paradigm in preventing a node
from flushing buffers after it has misbehaved.

Note that typically the only way to be 100% positive a node has no
buffers waiting after it has been fenced at the fabric level is a hard
reboot.

Many administrators will reboot a failed node as a first attempt to fix
it anyway - so we're just saving them a step :)  (Again, if you want,
you can always do (f) or (g) above...)

-- Lon

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Michael Will
Penguin Computing Corp.
Sales Engineer
415-954-2822
415-954-2899 fx
mwill@xxxxxxxxxxxxxxxxxxxx 

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster