We had a "that totally sucks" event the other night involving fencing. In short - Red Hat 4.7, 2 node cluster using iLO fencing with HP blade servers: - passive node detemined active node was unresponsive (missed too many heartbeats) - passive node initiates take-over and begins fencing process - fencing agent successfully powers off blade server - fencing agent sits in an endless loop trying to power on the blade, which won't power up - the cluster appears "stalled" at this point because fencing won't complete I was able to complete the failover by swapping out the fencing agent with a shell script that does "exit 0". This allowed the fencing agent to complete so the resource manager could successfully relocate the service. My question becomes: why isn't a successful power off considered sufficient for a take-over of a service? If the power is off, you've guaranteed that all resources are released by that node. By requiring a successful power on (which may never happen due to hardware failure,) the fencing agent becomes a single point of failure in the cluster. The fencing agent should make an attempt to power on a down node but it shouldn't hold up the failover process if that attempt fails. --Jeff Performance Engineer OpSource, Inc. http://www.opsource.net "Your Success is Our Success" -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster