Fencing quandry

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We had a "that totally sucks" event the other night involving fencing.
In short - Red Hat 4.7, 2 node cluster using iLO fencing with HP blade
servers:

- passive node detemined active node was unresponsive (missed too many
heartbeats)
- passive node initiates take-over and begins fencing process
- fencing agent successfully powers off blade server
- fencing agent sits in an endless loop trying to power on the blade,
which won't power up
- the cluster appears "stalled" at this point because fencing won't
complete

I was able to complete the failover by swapping out the fencing agent
with a shell script that does "exit 0". This allowed the fencing agent
to complete so the resource manager could successfully relocate the
service.

My question becomes: why isn't a successful power off considered
sufficient for a take-over of a service? If the power is off, you've
guaranteed that all resources are released by that node. By requiring a
successful power on (which may never happen due to hardware failure,)
the fencing agent becomes a single point of failure in the cluster. The
fencing agent should make an attempt to power on a down node but it
shouldn't hold up the failover process if that attempt fails.



--Jeff
Performance Engineer

OpSource, Inc.
http://www.opsource.net
"Your Success is Our Success"
 

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux