fencing problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hello,
I'm experiencing some problems with cluster fencing.
First lets start with the specs:

it's two node-cluster (Sun X4100) running RHEL4 Update 4 and RHCS 4

the machines both have ILOM device that acts as a first level of fencing.
then there is a second level of fencing that is performed by an UPS.

my problem is the following:
if i shutdown one of the nodes (simulating a power failure) the other tries to fence the failed node. So far so good. The problem is that since the ILOM in the node is offline the second node keeps trying to fence the ILOM device and never gives up!

According to what I've read on the FAQ about fencing levels, if the first level fails it should go to the second level, and so on...

But it never does this!

Here a copy of th /var/log/messages:

Dec 11 17:50:28 node_b kernel: CMAN: removing node node_a from the cluster : Missed too many heartbeats Dec 11 17:50:28 node_b fenced[3240]: node_a not a cluster member after 0 sec post_fail_delay
Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a"
Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports: Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect after 30 seconds Failed Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection descriptor received. Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid request descriptor
Dec 11 17:52:47 node_b fenced[3240]: fence "node_a" failed
Dec 11 17:52:52 node_b fenced[3240]: fencing node "node_a"

the last 4 lines repeat for ever....

here is a copy of the cluster.conf


<?xml version="1.0"?>
<cluster config_version="19" name="SERVER-A">
       <fence_daemon post_fail_delay="0" post_join_delay="3"/>
       <clusternodes>
               <clusternode name="node-a" votes="1">
                       <fence>
                               <method name="1">
                                       <device name="fence_node-a"/>
                               </method>
                               <method name="2">
                                       <device name="UPS_node-a"/>
                               </method>
                       </fence>
               </clusternode>
               <clusternode name="node-b" votes="1">
                       <fence>
                               <method name="1">
                                       <device name="fence_node-b"/>
                               </method>
                               <method name="2">
                                       <device name="UPS_node-b"/>
                               </method>
                       </fence>
               </clusternode>
       </clusternodes>
       <cman expected_votes="1" two_node="1"/>
       <fencedevices>
<fencedevice agent="fence_ipmilan" auth="password" ipaddr="172.18.57.17" login="root" name="fence_node-a" passwd="changeme"/> <fencedevice agent="fence_ipmilan" auth="password" ipaddr="172.18.57.18" login="root" name="fence_node-b" passwd="changeme"/> <fencedevice agent="fence_apc" ipaddr="172.18.57.20" login="power" name="UPS_node-a" passwd="power"/> <fencedevice agent="fence_apc" ipaddr="172.18.57.21" login="power" name="UPS_node-b" passwd="power"/>

       </fencedevices>
       <rm>
               <failoverdomains>
<failoverdomain name="Cluster_0" ordered="1" restricted="0"> <failoverdomainnode name="node-a" priority="1"/> <failoverdomainnode name="node-b" priority="1"/>
                       </failoverdomain>
               </failoverdomains>
               <resources>
<fs device="/dev/sdb1" force_fsck="1" force_unmount="1" fsid="46144" fstype="ext3" mountpoint="/mnt/shared" name="Storedge_Shared" options="" self_fence="1"/>
                       <ip address="172.18.57.16" monitor_link="1"/>
                       <ip address="172.18.57.11" monitor_link="1"/>
                       <ip address="172.18.57.14" monitor_link="1"/>
               </resources>
               <service autostart="1" domain="Cluster_0" name="postgresql">
                       <ip ref="172.18.57.16">
                               <fs ref="Storedge_Shared">
<script file="/etc/init.d/postgresql" name="PostgreSQL"> </fs>
                       </ip>
               </service>
               <service autostart="1" domain="Cluster_0" name="afs">
                       <ip ref="172.18.57.14">
                               <script file="/etc/init.d/afs" name="AFS"/>
                       </ip>
               </service>
       </rm>
</cluster>

I would like to know a way to solve this problem.... :-)

Thanks in advance,

Marcos David




--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux