openais[5817]: [TOTEM] The token was lost in the OPERATIONAL state.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Title: openais[5817]: [TOTEM] The token was lost in the OPERATIONAL state.

All,
        I have a 2 node test cluster made up of Dell 1850's with only virtual IP's as services supporting NFS on 3 GFS2 file systems using RHEL5U4 64 bit. Both nodes of the cluster export/share all 3 file systems all the time. When I create a NFS load that reduces the CPU %idle to less than 75% (as shown by top or vmstat) I have problems with my cluster crashing. I'm using iozone to generate the load from separate NFS clients. The higher the load on the cluster the more often this happens. Under a very heavy load it will fail within 5 minutes. But with a light load, CPU %idle above 75% I see no problems. One system logs messages like the following, the other one crashes.  Most of this CPU load is I/O wait time. The private network connecting my 2 node cluster together is currently a cat5 cross over cable. I tried a 10/100/1000 hub as well, but with it in I was logging collisions. The private network is using IP's 192.168.15.1 (hostname ic-cnfs01) and 192.168.15.2 (hostname ic-cnfs02). The storage is an EMC CX3-40, with PowerPath supporting the logical volumes the GFS2 file systems are built on.

        How do I prevent this condition from happening? Thanks in advance.


Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] The token was lost in the OPERATIONAL state.
Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] entering GATHER state from 2.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering GATHER state from 0.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Creating commit token because I am the rep.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Saving state aru c8 high seq received c8
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Storing new sequence id for ring 13c
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering COMMIT state.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering RECOVERY state.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] position [0] member 192.168.15.1:
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] previous ring seq 312 rep 192.168.15.1
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] aru c8 high delivered c8 received flag 1
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Did not need to originate any messages in recovery.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Sending initial ORF token
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] New Configuration:
Nov 13 11:39:19 cnfs01 kernel: dlm: closing connection to node 2
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ]   r(0) ip(192.168.15.1)
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] Members Left:
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ]   r(0) ip(192.168.15.2)
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] Members Joined:
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] New Configuration:
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ]   r(0) ip(192.168.15.1)
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] Members Left:
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] Members Joined:
Nov 13 11:39:19 cnfs01 openais[5817]: [SYNC ] This node is within the primary component and will provide service.
Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering OPERATIONAL state.
Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] got nodejoin message 192.168.15.1
Nov 13 11:39:20 cnfs01 openais[5817]: [CPG  ] got joinlist message from node 1
Nov 13 11:39:21 cnfs01 fenced[5836]: ic-cnfs02 not a cluster member after 2 sec post_fail_delay
Nov 13 11:39:21 cnfs01 fenced[5836]: fencing node "ic-cnfs02" 

Cluster RPM versions

rgmanager-2.0.52-1.el5_4.2         
lvm2-cluster-2.02.46-8.el5_4.1               
cman-2.0.115-1.el5_4.3                       
openais-0.80.6-8.el5_4.1                     
kmod-gfs2-1.92-1.1.el5_2.2                   
gfs2-utils-0.1.62-1.el5                      
perl-Config-General-2.40-1.el5               
system-config-cluster-1.0.57-1.5             
ricci-0.12.2-6.el5                           
piranha-0.8.4-13.el5                         
luci-0.12.2-6.el5                         
cluster-snmp-0.12.1-2.el5                 
cluster-cim-0.12.1-2.el5                     
Cluster_Administration-en-US-5.2-1           



The cluster.conf file is below


<?xml version="1.0"?>
<cluster alias="cnfs_cluster" config_version="78" name="cnfs">
        <fence_daemon clean_start="0" post_fail_delay="2" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="ic-cnfs01" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="IPMI_LAN_CNFS01"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="ic-cnfs02" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="IPMI_LAN_CNFS02"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="" ipaddr="ipmi-cnfs01" login="root" name="I
PMI_LAN_CNFS01" passwd="Rocknro11"/>
                <fencedevice agent="fence_ipmilan" auth="" ipaddr="ipmi-cnfs02" login="root" name="I
PMI_LAN_CNFS02" passwd="Rocknro11"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="failover-cnfs01-vip1" ordered="1" restricted="0">
                                <failoverdomainnode name="ic-cnfs01" priority="1"/>
                                <failoverdomainnode name="ic-cnfs02" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="failover-cnfs01-vip2" ordered="1" restricted="0">
                                <failoverdomainnode name="ic-cnfs01" priority="1"/>
                                <failoverdomainnode name="ic-cnfs02" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="failover-cnfs02-vip1" ordered="1" restricted="0">
                                <failoverdomainnode name="ic-cnfs02" priority="1"/>
                                <failoverdomainnode name="ic-cnfs01" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="failover-cnfs02-vip2" ordered="1" restricted="0">
                                <failoverdomainnode name="ic-cnfs01" priority="2"/>
                                <failoverdomainnode name="ic-cnfs02" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="failover-cnfs03-vip1" ordered="1" restricted="0">
                                <failoverdomainnode name="ic-cnfs01" priority="1"/>
                                <failoverdomainnode name="ic-cnfs02" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="failover-cnfs03-vip2" ordered="1" restricted="0">
                                <failoverdomainnode name="ic-cnfs01" priority="2"/>
                                <failoverdomainnode name="ic-cnfs02" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
                <service autostart="1" domain="failover-cnfs02-vip1" name="cnfs02-vip1" recovery="restart">
                        <ip address="172.19.130.154" monitor_link="1"/>
                </service>
                <service autostart="1" domain="failover-cnfs01-vip2" name="cnfs01-vip2" recovery="restart">
                        <ip address="172.19.130.156" monitor_link="1"/>
                </service>
                <service autostart="1" domain="failover-cnfs02-vip2" name="cnfs02-vip2" recovery="restart">
                        <ip address="172.19.130.157" monitor_link="1"/>
                </service>
                <service autostart="1" domain="failover-cnfs01-vip1" name="cnfs01-vip1" recovery="restart">
                        <ip address="172.19.130.153" monitor_link="1"/>
                </service>
                <service autostart="1" domain="failover-cnfs01-vip1" name="cnfs03-vip1" recovery="restart">
                        <ip address="172.19.130.155" monitor_link="1"/>
                </service>
                <service autostart="1" domain="failover-cnfs03-vip2" name="cnfs03-vip2" recovery="restart">
                        <ip address="172.19.130.158" monitor_link="1"/>
                </service>
        </rm>
</cluster>o

The openais.conf file is below

# Please read the openais.conf.5 manual page

totem {
        version: 2
        secauth: off
        threads: 0
        interface {
                ringnumber: 0
                bindnetaddr: 192.168.15.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
        }
}

logging {
        debug: off
        timestamp: on
        to_syslog
}

amf {
        mode: disabled
}

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 Jon Swift                   Pratt & Whitney Rocketdyne
                             Unix Team Technical Lead
                             email  : jon.swift@xxxxxxxxxxx 
                             phone  : (818) 586-4029
                             pager  : (818) 328-4112
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux