RE: Fencing quandry

"Hofmeister, James (WTEC Linux)" <james.hofmeister@xxxxxx> · Tue, 14 Oct 2008 17:39:48 +0000

Hello Jeff,

I am working with RedHat on a RHEL-5 fencing issue with c-class blades...  We have bugzilla 433864 opened for this and my notes state to be resolved in RHEL-5.3.

We had a workaround in the RHEL-5 cluster configuration:

  In the /etc/cluster/cluster.conf

  *Update version number by 1.
  *Then edit the fence device section for "each" node for example:

                        <fence>
                                <method name="1">
                                        <device name="ilo01"/>
                                </method>
                        </fence>
  change to  -->
                        <fence>
                                <method name="1">
                                        <device name="ilo01" action="off"/>
                                        <device name="ilo01" action="on"/>
                                </method>
                        </fence>

Regards,
James Hofmeister
Hewlett Packard Linux Solutions Engineer

|-----Original Message-----
|From: linux-cluster-bounces@xxxxxxxxxx
|[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeff Stoner
|Sent: Tuesday, October 14, 2008 8:32 AM
|To: linux clustering
|Subject:  Fencing quandry
|
|We had a "that totally sucks" event the other night involving fencing.
|In short - Red Hat 4.7, 2 node cluster using iLO fencing with HP blade
|servers:
|
|- passive node detemined active node was unresponsive (missed too many
|heartbeats)
|- passive node initiates take-over and begins fencing process
|- fencing agent successfully powers off blade server
|- fencing agent sits in an endless loop trying to power on the
|blade, which won't power up
|- the cluster appears "stalled" at this point because fencing
|won't complete
|
|I was able to complete the failover by swapping out the
|fencing agent with a shell script that does "exit 0". This
|allowed the fencing agent to complete so the resource manager
|could successfully relocate the service.
|
|My question becomes: why isn't a successful power off
|considered sufficient for a take-over of a service? If the
|power is off, you've guaranteed that all resources are
|released by that node. By requiring a successful power on
|(which may never happen due to hardware failure,) the fencing
|agent becomes a single point of failure in the cluster. The
|fencing agent should make an attempt to power on a down node
|but it shouldn't hold up the failover process if that attempt fails.
|
|
|
|--Jeff
|Performance Engineer
|
|OpSource, Inc.
|http://www.opsource.net
|"Your Success is Our Success"
|
|
|--
|Linux-cluster mailing list
|Linux-cluster@xxxxxxxxxx
|https://www.redhat.com/mailman/listinfo/linux-cluster
|

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster