Hi,
My name is Jon Swift and I have been trying on and off for a few months to get NFS v3 using GFS1 and RHEL5.1 to work dependably. I have a 3 node cluster made up of Dell 1850's using 2 HBA's to connect to a EMC CX3-40 SAN on each. I use PowerPath 5.1 to manage the multiple paths to the SAN, and I have tried both a Dell DRAC and IPMI fencing, currently configured to use IPMI fencing. My number one dependability issue is, that following a reboot of 1 node in my cluster, I have problems with the remaining nodes. I'm using the following steps before rebooting the node.
/etc/init.d/rgmanager stop
exportfs -ua
/etc/init.d/nfs stop
/etc/init.d/gfs stop
/etc/init.d/clvmd stop
/sbin/fence_tool leave
sync; sync
sleep 5
/etc/init.d/cman stop leave
Reboot
Then on the other nodes in the cluster just after the reboot step completes on the rebooted node CMAN is stopped, and GFS hangs on the remaining nodes, most of the time. The following is typical of the entries in the /var/log/messages file.
Apr 17 15:07:46 cnfs03 openais[5179]: [TOTEM] entering GATHER state from 12.
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] entering GATHER state from 11.
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] Saving state aru 1016 high seq received 1016
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] Storing new sequence id for ring 6ffe0
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] entering COMMIT state.
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] entering RECOVERY state.
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] position [0] member 192.168.15.2:
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] previous ring seq 458716 rep 192.168.15.1
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] aru 1016 high delivered 1016 received flag 1
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] position [1] member 192.168.15.3:
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] previous ring seq 458716 rep 192.168.15.1
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] aru 1016 high delivered 1016 received flag 1
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] Did not need to originate any messages in recovery.
Apr 17 15:07:51 cnfs03 kernel: dlm: closing connection to node 1
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] CLM CONFIGURATION CHANGE
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] New Configuration:
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] r(0) ip(192.168.15.2)
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] r(0) ip(192.168.15.3)
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] Members Left:
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] r(0) ip(192.168.15.1)
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] Members Joined:
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] CLM CONFIGURATION CHANGE
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] New Configuration:
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] r(0) ip(192.168.15.2)
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] r(0) ip(192.168.15.3)
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] Members Left:
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] Members Joined:
Apr 17 15:07:51 cnfs03 openais[5179]: [SYNC ] This node is within the primary component and will provide servi
ce.
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] entering OPERATIONAL state.
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] got nodejoin message 192.168.15.2
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM ] got nodejoin message 192.168.15.3
Apr 17 15:07:57 cnfs03 gfs_controld[5207]: cluster is down, exiting
Apr 17 15:07:57 cnfs03 dlm_controld[5201]: groupd is down, exiting
Apr 17 15:07:57 cnfs03 kernel: dlm: connecting to 2
Apr 17 15:07:57 cnfs03 fenced[5195]: cluster is down, exiting
Apr 17 15:08:01 cnfs03 kernel: dlm: closing connection to node 3
Apr 17 15:08:10 cnfs03 kernel: dlm: closing connection to node 2
Apr 17 15:08:23 cnfs03 ccsd[5159]: Unable to connect to cluster infrastructure after 30 seconds.
When the node rebooted comes back up it starts CMAN, but does not start clvmd and does not join the fence domain. And of course none of the GFS file systems are mounted.
Below is a copy of the /etc/cluster/cluster.conf file.
<?xml version="1.0"?>
<cluster alias="cnfs_cluster" config_version="40" name="cnfs">
<fence_daemon clean_start="0" post_fail_delay="2" post_join_delay="20"/>
<clusternodes>
<clusternode name="ic-cnfs01" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="IPMI_LAN_CNFS01"/>
</method>
</fence>
</clusternode>
<clusternode name="ic-cnfs02" nodeid="2" votes="1">
<fence>
<method name="1">
<device lanplus="" name="IPMI_LAN_CNFS02"/>
</method>
</fence>
</clusternode>
<clusternode name="ic-cnfs03" nodeid="3" votes="1">
<fence>
<method name="1">
<device lanplus="" name="IPMI_LAN_CNFS03"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_ipmilan" auth="none" ipaddr="ipmi-cnfs01" login="root" name="IPMI_LAN_CNFS01" passwd="Rocknro11"/>
<fencedevice agent="fence_ipmilan" auth="md5" ipaddr="ipmi-cnfs02" login="root" name="IPMI_LAN_CNFS02" passwd="Rocknro11"/>
<fencedevice agent="fence_ipmilan" auth="md5" ipaddr="ipmi-cnfs03" login="root" name="IPMI_LAN_CNFS03" passwd="Rocknro11"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="cnfs03-down-failover" ordered="1">
<failoverdomainnode name="ic-cnfs01" priority="3"/>
<failoverdomainnode name="ic-cnfs02" priority="2"/>
<failoverdomainnode name="ic-cnfs03" priority="1"/>
</failoverdomain>
<failoverdomain name="cnfs02-down-failover" ordered="1">
<failoverdomainnode name="ic-cnfs01" priority="2"/>
<failoverdomainnode name="ic-cnfs02" priority="1"/>
<failoverdomainnode name="ic-cnfs03" priority="3"/>
</failoverdomain>
<failoverdomain name="cnfs01-down-failover" ordered="1">
<failoverdomainnode name="ic-cnfs01" priority="1"/>
<failoverdomainnode name="ic-cnfs02" priority="3"/>
<failoverdomainnode name="ic-cnfs03" priority="2"/>
</failoverdomain>
<failoverdomain name="cnfs01-up-failover" ordered="1" restricted="0">
<failoverdomainnode name="ic-cnfs01" priority="1"/>
<failoverdomainnode name="ic-cnfs02" priority="2"/>
<failoverdomainnode name="ic-cnfs03" priority="3"/>
</failoverdomain>
<failoverdomain name="cnfs02-up-failover" ordered="1">
<failoverdomainnode name="ic-cnfs01" priority="3"/>
<failoverdomainnode name="ic-cnfs02" priority="1"/>
<failoverdomainnode name="ic-cnfs03" priority="2"/>
</failoverdomain>
<failoverdomain name="cnfs03-up-failover" ordered="1" restricted="0">
<failoverdomainnode name="ic-cnfs01" priority="2"/>
<failoverdomainnode name="ic-cnfs02" priority="3"/>
<failoverdomainnode name="ic-cnfs03" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources/>
<service autostart="1" domain="cnfs01-up-failover" name="cnfs01-vip1" recovery="restart">
<ip address="XXXXXXX" monitor_link="1"/>
</service>
</service>
<service autostart="1" domain="cnfs02-up-failover" name="cnfs02-vip1" recovery="restart">
<ip address="XXXXXXX" monitor_link="1"/>
</service>
<service autostart="1" domain="cnfs03-up-failover" name="cnfs03-vip1" recovery="restart">
<ip address="XXXXXXX" monitor_link="1"/>
</service>
<service autostart="1" domain="cnfs01-down-failover" name="cnfs01-vip2" recovery="restart">
<ip address="XXXXXXX" monitor_link="1"/>
</service>
<service autostart="1" domain="cnfs02-down-failover" name="cnfs02-vip2" recovery="restart">
<ip address="XXXXXXX" monitor_link="1"/>
</service>
<service autostart="1" domain="cnfs03-down-failover" name="cnfs03-vip2" recovery="restart">
<ip address="XXXXXXX" monitor_link="1"/>
</service>
</rm>
</cluster>
How can I safely reboot one node, without affecting CMAN and GFS on the remaining nodes in the cluster?
Thanks in advance.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Jon Swift Pratt & Whitney Rocketdyne
Unix Team Technical Lead
email : jon.swift@xxxxxxxxxxx
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster