Hello Jeff, RE: RE: Fencing quandary The root issue is the ILO scripts are not up to date with the current firmware rev in the c-class and p-class blades. The method of '<device name="ilo01"/>' for a "reboot" is not working with this ILO firmware rev and the workaround is to send 2 commands to ILO under a single method... 'action="off"/' and 'action="on"/'. I had tested this with my p-class blades and it was successful. I am still waiting for my customers test results on their c-class blades. ...yes this is the root issue to the ILO problem, but it does not completely address your concern. I believe you are saying: That the RHCS does not accept a "power off" as a fence, but is requiring both "power off" followed by "power on". Regards, James Hofmeister Hewlett Packard Linux Solutions Engineer |-----Original Message----- |From: linux-cluster-bounces@xxxxxxxxxx |[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeff Stoner |Sent: Tuesday, October 14, 2008 3:43 PM |To: linux clustering |Subject: RE: RE: Fencing quandry | |Thanks for the response, James. Unfortunately, it doesn't |fully answer my question or at least, I'm not following the |logic. The bug report would seem to indicate a problem with |using the default "reboot" method of the agent. The work |around simply replaces the single fence device |('reboot') with 2 fence devices ('off' followed by 'on') in |the same fence method. If the server fails to power on, then, |according to the FAQ, fencing still fails ("All fence devices |within a fence method must succeed in order for the method to |succeed"). | |I'm back to fenced being a SPoF if hardware failures prevent a |fenced node from powering on. | |--Jeff |Performance Engineer | |OpSource, Inc. |http://www.opsource.net |"Your Success is Our Success" | | |> -----Original Message----- |> From: linux-cluster-bounces@xxxxxxxxxx |> [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Hofmeister, |> James (WTEC Linux) |> Sent: Tuesday, October 14, 2008 1:40 PM |> To: linux clustering |> Subject: RE: Fencing quandry |> |> Hello Jeff, |> |> I am working with RedHat on a RHEL-5 fencing issue with c-class |> blades... We have bugzilla 433864 opened for this and my |notes state |> to be resolved in RHEL-5.3. |> |> We had a workaround in the RHEL-5 cluster configuration: |> |> In the /etc/cluster/cluster.conf |> |> *Update version number by 1. |> *Then edit the fence device section for "each" node for example: |> |> <fence> |> <method name="1"> |> <device name="ilo01"/> |> </method> |> </fence> |> change to --> |> <fence> |> <method name="1"> |> <device name="ilo01" |> action="off"/> |> <device name="ilo01" |> action="on"/> |> </method> |> </fence> |> |> Regards, |> James Hofmeister |> Hewlett Packard Linux Solutions Engineer |> |> |> |> |-----Original Message----- |> |From: linux-cluster-bounces@xxxxxxxxxx |> |[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeff Stoner |> |Sent: Tuesday, October 14, 2008 8:32 AM |> |To: linux clustering |> |Subject: Fencing quandry |> | |> |We had a "that totally sucks" event the other night |> involving fencing. |> |In short - Red Hat 4.7, 2 node cluster using iLO fencing |> with HP blade |> |servers: |> | |> |- passive node detemined active node was unresponsive |> (missed too many |> |heartbeats) |> |- passive node initiates take-over and begins fencing process |> |- fencing agent successfully powers off blade server |> |- fencing agent sits in an endless loop trying to power on |the blade, |> |which won't power up |> |- the cluster appears "stalled" at this point because fencing won't |> |complete |> | |> |I was able to complete the failover by swapping out the |fencing agent |> |with a shell script that does "exit 0". This allowed the fencing |> |agent to complete so the resource manager could |successfully relocate |> |the service. |> | |> |My question becomes: why isn't a successful power off considered |> |sufficient for a take-over of a service? If the power is |off, you've |> |guaranteed that all resources are released by that node. By |requiring |> |a successful power on (which may never happen due to hardware |> |failure,) the fencing agent becomes a single point of |failure in the |> |cluster. The fencing agent should make an attempt to power |on a down |> |node but it shouldn't hold up the failover process if that attempt |> |fails. |> | |> | |> | |> |--Jeff |> |Performance Engineer |> | |> |OpSource, Inc. |> |http://www.opsource.net |> |"Your Success is Our Success" |> | |> | |> |-- |> |Linux-cluster mailing list |> |Linux-cluster@xxxxxxxxxx |> |https://www.redhat.com/mailman/listinfo/linux-cluster |> | |> |> -- |> Linux-cluster mailing list |> Linux-cluster@xxxxxxxxxx |> https://www.redhat.com/mailman/listinfo/linux-cluster |> |> | |-- |Linux-cluster mailing list |Linux-cluster@xxxxxxxxxx |https://www.redhat.com/mailman/listinfo/linux-cluster | -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster