RE: RE: Fencing quandry

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Jeff,

RE:  RE: Fencing quandary

The root issue is the ILO scripts are not up to date with the current firmware rev in the c-class and p-class blades.

The method of '<device name="ilo01"/>' for a "reboot" is not working with this ILO firmware rev and the workaround is to send 2 commands to ILO under a single method... 'action="off"/' and 'action="on"/'.

I had tested this with my p-class blades and it was successful.  I am still waiting for my customers test results on their c-class blades.

...yes this is the root issue to the ILO problem, but it does not completely address your concern.  I believe you are saying: That the RHCS does not accept a "power off" as a fence, but is requiring both "power off" followed by "power on".

Regards,
James Hofmeister
Hewlett Packard Linux Solutions Engineer

|-----Original Message-----
|From: linux-cluster-bounces@xxxxxxxxxx
|[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeff Stoner
|Sent: Tuesday, October 14, 2008 3:43 PM
|To: linux clustering
|Subject: RE:  RE: Fencing quandry
|
|Thanks for the response, James. Unfortunately, it doesn't
|fully answer my question or at least, I'm not following the
|logic. The bug report would seem to indicate a problem with
|using the default "reboot" method of the agent. The work
|around simply replaces the single fence device
|('reboot') with 2 fence devices ('off' followed by 'on') in
|the same fence method. If the server fails to power on, then,
|according to the FAQ, fencing still fails ("All fence devices
|within a fence method must succeed in order for the method to
|succeed").
|
|I'm back to fenced being a SPoF if hardware failures prevent a
|fenced node from powering on.
|
|--Jeff
|Performance Engineer
|
|OpSource, Inc.
|http://www.opsource.net
|"Your Success is Our Success"
|
|
|> -----Original Message-----
|> From: linux-cluster-bounces@xxxxxxxxxx
|> [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Hofmeister,
|> James (WTEC Linux)
|> Sent: Tuesday, October 14, 2008 1:40 PM
|> To: linux clustering
|> Subject:  RE: Fencing quandry
|>
|> Hello Jeff,
|>
|> I am working with RedHat on a RHEL-5 fencing issue with c-class
|> blades...  We have bugzilla 433864 opened for this and my
|notes state
|> to be resolved in RHEL-5.3.
|>
|> We had a workaround in the RHEL-5 cluster configuration:
|>
|>   In the /etc/cluster/cluster.conf
|>
|>   *Update version number by 1.
|>   *Then edit the fence device section for "each" node for example:
|>
|>                         <fence>
|>                                 <method name="1">
|>                                         <device name="ilo01"/>
|>                                 </method>
|>                         </fence>
|>   change to  -->
|>                         <fence>
|>                                 <method name="1">
|>                                         <device name="ilo01"
|> action="off"/>
|>                                         <device name="ilo01"
|> action="on"/>
|>                                 </method>
|>                         </fence>
|>
|> Regards,
|> James Hofmeister
|> Hewlett Packard Linux Solutions Engineer
|>
|>
|>
|> |-----Original Message-----
|> |From: linux-cluster-bounces@xxxxxxxxxx
|> |[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jeff Stoner
|> |Sent: Tuesday, October 14, 2008 8:32 AM
|> |To: linux clustering
|> |Subject:  Fencing quandry
|> |
|> |We had a "that totally sucks" event the other night
|> involving fencing.
|> |In short - Red Hat 4.7, 2 node cluster using iLO fencing
|> with HP blade
|> |servers:
|> |
|> |- passive node detemined active node was unresponsive
|> (missed too many
|> |heartbeats)
|> |- passive node initiates take-over and begins fencing process
|> |- fencing agent successfully powers off blade server
|> |- fencing agent sits in an endless loop trying to power on
|the blade,
|> |which won't power up
|> |- the cluster appears "stalled" at this point because fencing won't
|> |complete
|> |
|> |I was able to complete the failover by swapping out the
|fencing agent
|> |with a shell script that does "exit 0". This allowed the fencing
|> |agent to complete so the resource manager could
|successfully relocate
|> |the service.
|> |
|> |My question becomes: why isn't a successful power off considered
|> |sufficient for a take-over of a service? If the power is
|off, you've
|> |guaranteed that all resources are released by that node. By
|requiring
|> |a successful power on (which may never happen due to hardware
|> |failure,) the fencing agent becomes a single point of
|failure in the
|> |cluster. The fencing agent should make an attempt to power
|on a down
|> |node but it shouldn't hold up the failover process if that attempt
|> |fails.
|> |
|> |
|> |
|> |--Jeff
|> |Performance Engineer
|> |
|> |OpSource, Inc.
|> |http://www.opsource.net
|> |"Your Success is Our Success"
|> |
|> |
|> |--
|> |Linux-cluster mailing list
|> |Linux-cluster@xxxxxxxxxx
|> |https://www.redhat.com/mailman/listinfo/linux-cluster
|> |
|>
|> --
|> Linux-cluster mailing list
|> Linux-cluster@xxxxxxxxxx
|> https://www.redhat.com/mailman/listinfo/linux-cluster
|>
|>
|
|--
|Linux-cluster mailing list
|Linux-cluster@xxxxxxxxxx
|https://www.redhat.com/mailman/listinfo/linux-cluster
|

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux