Hello Jeff, I am working with RedHat on a RHEL-5 fencing issue with c-class blades... We have bugzilla 433864 opened for this and my notes state to be resolved in RHEL-5.3. We had a workaround in the RHEL-5 cluster configuration: In the /etc/cluster/cluster.conf *Update version number by 1. *Then edit the fence device section for "each" node for example: <fence> <method name="1"> <device name="ilo01"/> </method> </fence> change to --> <fence> <method name="1"> <device name="ilo01" action="off"/> <device name="ilo01" action="on"/> </method> </fence> Regards, James Hofmeister Hewlett Packard Linux Solutions Engineer |-----Original Message----- |From: linux-cluster-bounces@xxxxxxxxxx |[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeff Stoner |Sent: Tuesday, October 14, 2008 8:32 AM |To: linux clustering |Subject: Fencing quandry | |We had a "that totally sucks" event the other night involving fencing. |In short - Red Hat 4.7, 2 node cluster using iLO fencing with HP blade |servers: | |- passive node detemined active node was unresponsive (missed too many |heartbeats) |- passive node initiates take-over and begins fencing process |- fencing agent successfully powers off blade server |- fencing agent sits in an endless loop trying to power on the |blade, which won't power up |- the cluster appears "stalled" at this point because fencing |won't complete | |I was able to complete the failover by swapping out the |fencing agent with a shell script that does "exit 0". This |allowed the fencing agent to complete so the resource manager |could successfully relocate the service. | |My question becomes: why isn't a successful power off |considered sufficient for a take-over of a service? If the |power is off, you've guaranteed that all resources are |released by that node. By requiring a successful power on |(which may never happen due to hardware failure,) the fencing |agent becomes a single point of failure in the cluster. The |fencing agent should make an attempt to power on a down node |but it shouldn't hold up the failover process if that attempt fails. | | | |--Jeff |Performance Engineer | |OpSource, Inc. |http://www.opsource.net |"Your Success is Our Success" | | |-- |Linux-cluster mailing list |Linux-cluster@xxxxxxxxxx |https://www.redhat.com/mailman/listinfo/linux-cluster | -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster