All,
I have a 2 node test cluster made up of Dell 1850's with only virtual IP's as services supporting NFS on 3 GFS2 file systems using RHEL5U4 64 bit. Both nodes of the cluster export/share all 3 file systems all the time. When I create a NFS load that reduces the CPU %idle to less than 75% (as shown by top or vmstat) I have problems with my cluster crashing. I'm using iozone to generate the load from separate NFS clients. The higher the load on the cluster the more often this happens. Under a very heavy load it will fail within 5 minutes. But with a light load, CPU %idle above 75% I see no problems. One system logs messages like the following, the other one crashes. Most of this CPU load is I/O wait time. The private network connecting my 2 node cluster together is currently a cat5 cross over cable. I tried a 10/100/1000 hub as well, but with it in I was logging collisions. The private network is using IP's 192.168.15.1 (hostname ic-cnfs01) and 192.168.15.2 (hostname ic-cnfs02). The storage is an EMC CX3-40, with PowerPath supporting the logical volumes the GFS2 file systems are built on.
How do I prevent this condition from happening? Thanks in advance.
Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] The token was lost in the OPERATIONAL state.
Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] entering GATHER state from 2.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering GATHER state from 0.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Creating commit token because I am the rep.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Saving state aru c8 high seq received c8
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Storing new sequence id for ring 13c
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering COMMIT state.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering RECOVERY state.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] position [0] member 192.168.15.1:
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] previous ring seq 312 rep 192.168.15.1
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] aru c8 high delivered c8 received flag 1
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Did not need to originate any messages in recovery.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Sending initial ORF token
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] CLM CONFIGURATION CHANGE
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] New Configuration:
Nov 13 11:39:19 cnfs01 kernel: dlm: closing connection to node 2
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] r(0) ip(192.168.15.1)
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] Members Left:
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] r(0) ip(192.168.15.2)
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] Members Joined:
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] CLM CONFIGURATION CHANGE
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] New Configuration:
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] r(0) ip(192.168.15.1)
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] Members Left:
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] Members Joined:
Nov 13 11:39:19 cnfs01 openais[5817]: [SYNC ] This node is within the primary component and will provide service.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering OPERATIONAL state.
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM ] got nodejoin message 192.168.15.1
Nov 13 11:39:20 cnfs01 openais[5817]: [CPG ] got joinlist message from node 1
Nov 13 11:39:21 cnfs01 fenced[5836]: ic-cnfs02 not a cluster member after 2 sec post_fail_delay
Nov 13 11:39:21 cnfs01 fenced[5836]: fencing node "ic-cnfs02"
Cluster RPM versions
rgmanager-2.0.52-1.el5_4.2
lvm2-cluster-2.02.46-8.el5_4.1
cman-2.0.115-1.el5_4.3
openais-0.80.6-8.el5_4.1
kmod-gfs2-1.92-1.1.el5_2.2
gfs2-utils-0.1.62-1.el5
perl-Config-General-2.40-1.el5
system-config-cluster-1.0.57-1.5
ricci-0.12.2-6.el5
piranha-0.8.4-13.el5
luci-0.12.2-6.el5
cluster-snmp-0.12.1-2.el5
cluster-cim-0.12.1-2.el5
Cluster_Administration-en-US-5.2-1
The cluster.conf file is below
<?xml version="1.0"?>
<cluster alias="cnfs_cluster" config_version="78" name="cnfs">
<fence_daemon clean_start="0" post_fail_delay="2" post_join_delay="20"/>
<clusternodes>
<clusternode name="ic-cnfs01" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="IPMI_LAN_CNFS01"/>
</method>
</fence>
</clusternode>
<clusternode name="ic-cnfs02" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="IPMI_LAN_CNFS02"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_ipmilan" auth="" ipaddr="ipmi-cnfs01" login="root" name="I
PMI_LAN_CNFS01" passwd="Rocknro11"/>
<fencedevice agent="fence_ipmilan" auth="" ipaddr="ipmi-cnfs02" login="root" name="I
PMI_LAN_CNFS02" passwd="Rocknro11"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="failover-cnfs01-vip1" ordered="1" restricted="0">
<failoverdomainnode name="ic-cnfs01" priority="1"/>
<failoverdomainnode name="ic-cnfs02" priority="2"/>
</failoverdomain>
<failoverdomain name="failover-cnfs01-vip2" ordered="1" restricted="0">
<failoverdomainnode name="ic-cnfs01" priority="1"/>
<failoverdomainnode name="ic-cnfs02" priority="2"/>
</failoverdomain>
<failoverdomain name="failover-cnfs02-vip1" ordered="1" restricted="0">
<failoverdomainnode name="ic-cnfs02" priority="1"/>
<failoverdomainnode name="ic-cnfs01" priority="2"/>
</failoverdomain>
<failoverdomain name="failover-cnfs02-vip2" ordered="1" restricted="0">
<failoverdomainnode name="ic-cnfs01" priority="2"/>
<failoverdomainnode name="ic-cnfs02" priority="1"/>
</failoverdomain>
<failoverdomain name="failover-cnfs03-vip1" ordered="1" restricted="0">
<failoverdomainnode name="ic-cnfs01" priority="1"/>
<failoverdomainnode name="ic-cnfs02" priority="2"/>
</failoverdomain>
<failoverdomain name="failover-cnfs03-vip2" ordered="1" restricted="0">
<failoverdomainnode name="ic-cnfs01" priority="2"/>
<failoverdomainnode name="ic-cnfs02" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources/>
<service autostart="1" domain="failover-cnfs02-vip1" name="cnfs02-vip1" recovery="restart">
<ip address="172.19.130.154" monitor_link="1"/>
</service>
<service autostart="1" domain="failover-cnfs01-vip2" name="cnfs01-vip2" recovery="restart">
<ip address="172.19.130.156" monitor_link="1"/>
</service>
<service autostart="1" domain="failover-cnfs02-vip2" name="cnfs02-vip2" recovery="restart">
<ip address="172.19.130.157" monitor_link="1"/>
</service>
<service autostart="1" domain="failover-cnfs01-vip1" name="cnfs01-vip1" recovery="restart">
<ip address="172.19.130.153" monitor_link="1"/>
</service>
<service autostart="1" domain="failover-cnfs01-vip1" name="cnfs03-vip1" recovery="restart">
<ip address="172.19.130.155" monitor_link="1"/>
</service>
<service autostart="1" domain="failover-cnfs03-vip2" name="cnfs03-vip2" recovery="restart">
<ip address="172.19.130.158" monitor_link="1"/>
</service>
</rm>
</cluster>o
The openais.conf file is below
# Please read the openais.conf.5 manual page
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: 192.168.15.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
logging {
debug: off
timestamp: on
to_syslog
}
amf {
mode: disabled
}
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Jon Swift Pratt & Whitney Rocketdyne
Unix Team Technical Lead
email : jon.swift@xxxxxxxxxxx
phone : (818) 586-4029
pager : (818) 328-4112
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster