Thanks for the response, James. Unfortunately, it doesn't fully answer my question or at least, I'm not following the logic. The bug report would seem to indicate a problem with using the default "reboot" method of the agent. The work around simply replaces the single fence device ('reboot') with 2 fence devices ('off' followed by 'on') in the same fence method. If the server fails to power on, then, according to the FAQ, fencing still fails ("All fence devices within a fence method must succeed in order for the method to succeed"). I'm back to fenced being a SPoF if hardware failures prevent a fenced node from powering on. --Jeff Performance Engineer OpSource, Inc. http://www.opsource.net "Your Success is Our Success" > -----Original Message----- > From: linux-cluster-bounces@xxxxxxxxxx > [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of > Hofmeister, James (WTEC Linux) > Sent: Tuesday, October 14, 2008 1:40 PM > To: linux clustering > Subject: RE: Fencing quandry > > Hello Jeff, > > I am working with RedHat on a RHEL-5 fencing issue with > c-class blades... We have bugzilla 433864 opened for this > and my notes state to be resolved in RHEL-5.3. > > We had a workaround in the RHEL-5 cluster configuration: > > In the /etc/cluster/cluster.conf > > *Update version number by 1. > *Then edit the fence device section for "each" node for example: > > <fence> > <method name="1"> > <device name="ilo01"/> > </method> > </fence> > change to --> > <fence> > <method name="1"> > <device name="ilo01" > action="off"/> > <device name="ilo01" > action="on"/> > </method> > </fence> > > Regards, > James Hofmeister > Hewlett Packard Linux Solutions Engineer > > > > |-----Original Message----- > |From: linux-cluster-bounces@xxxxxxxxxx > |[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeff Stoner > |Sent: Tuesday, October 14, 2008 8:32 AM > |To: linux clustering > |Subject: Fencing quandry > | > |We had a "that totally sucks" event the other night > involving fencing. > |In short - Red Hat 4.7, 2 node cluster using iLO fencing > with HP blade > |servers: > | > |- passive node detemined active node was unresponsive > (missed too many > |heartbeats) > |- passive node initiates take-over and begins fencing process > |- fencing agent successfully powers off blade server > |- fencing agent sits in an endless loop trying to power on the > |blade, which won't power up > |- the cluster appears "stalled" at this point because fencing > |won't complete > | > |I was able to complete the failover by swapping out the > |fencing agent with a shell script that does "exit 0". This > |allowed the fencing agent to complete so the resource manager > |could successfully relocate the service. > | > |My question becomes: why isn't a successful power off > |considered sufficient for a take-over of a service? If the > |power is off, you've guaranteed that all resources are > |released by that node. By requiring a successful power on > |(which may never happen due to hardware failure,) the fencing > |agent becomes a single point of failure in the cluster. The > |fencing agent should make an attempt to power on a down node > |but it shouldn't hold up the failover process if that attempt fails. > | > | > | > |--Jeff > |Performance Engineer > | > |OpSource, Inc. > |http://www.opsource.net > |"Your Success is Our Success" > | > | > |-- > |Linux-cluster mailing list > |Linux-cluster@xxxxxxxxxx > |https://www.redhat.com/mailman/listinfo/linux-cluster > | > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster