Janne Peltonen wrote:
Hi!
I started wondering what happens if my fence device is broken. The
scenario:
-a node (running a service) fails
-another node notices the lost heartbeats and tries to fence the failed
node
-however, the fence device doesn't respond
-...what now?
I tried to simulate the situation with our test cluster of two HP Blade
servers, using iLO fencing, by misconfiguring the fencing agent to use a
wrong username to authenticate to the iLO. What happens is, the fenced
on the running node tries to fence the failed node over and over again,
and the service I'm trying to fail over will never leave state "Started"
on node "Unknown"... that is, the cluster won't fail it over to the
running node.
Not good.
Actually, it is good. A node failure comes in many shapes and sizes,
from a full system failure (where the whole machine is powered off) to
a partial failure (where only the NIC used for heartbeat failed, but not
the OS or disk controllers) If only the NIC fails, your service is
still running, still updating the hard drive, and still generally
running correctly, but it's not able to send heartbeats.
Now, if the other system trys to take over the service, and assumes
that the failed node is offline, then it will mount the drive, start the
service, and since two systems both have the same non-clustered
filesystem mounted read-write they will corrupt it pretty quickly.
Which is what fencing is designed to prevent.
So to keep that scenario from happening, the cluster software
ensures that a successful fence occurs before continuing operation.
It's a fail-safe style setup. Better to take 30 minutes downtime for an
admin to make the right decision than corrupt your filesystems and have
to take 8 -24 hours downtime to restore the system.
Thanks,
Eric Kerin
eric@xxxxxxxxxxx
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster