openais[5817]: [TOTEM] The token was lost in the OPERATIONAL state.

"Swift, Jon S PWR" <Jon.Swift@xxxxxxxxxxx> · Fri, 13 Nov 2009 15:38:04 -0800

Title: openais[5817]: [TOTEM] The token was lost in the OPERATIONAL state.

All,

        I have a 2 node test cluster made up of Dell 1850's with only virtual IP's as services supporting NFS on 3 GFS2 file systems using RHEL5U4 64 bit. Both nodes of the cluster export/share all 3 file systems all the time. When I create a NFS load that reduces the CPU %idle to less than 75% (as shown by top or vmstat) I have problems with my cluster crashing. I'm using iozone to generate the load from separate NFS clients. The higher the load on the cluster the more often this happens. Under a very heavy load it will fail within 5 minutes. But with a light load, CPU %idle above 75% I see no problems. One system logs messages like the following, the other one crashes.  Most of this CPU load is I/O wait time. The private network connecting my 2 node cluster together is currently a cat5 cross over cable. I tried a 10/100/1000 hub as well, but with it in I was logging collisions. The private network is using IP's 192.168.15.1 (hostname ic-cnfs01) and 192.168.15.2 (hostname ic-cnfs02). The storage is an EMC CX3-40, with PowerPath supporting the logical volumes the GFS2 file systems are built on.

        How do I prevent this condition from happening? Thanks in advance.

Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] The token was lost in the OPERATIONAL state.

Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).

Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).

Nov 13 11:39:14 cnfs01 openais[5817]: [TOTEM] entering GATHER state from 2.

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering GATHER state from 0.

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Creating commit token because I am the rep.

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Saving state aru c8 high seq received c8

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Storing new sequence id for ring 13c

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering COMMIT state.

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering RECOVERY state.

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] position [0] member 192.168.15.1:

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] previous ring seq 312 rep 192.168.15.1

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] aru c8 high delivered c8 received flag 1

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Did not need to originate any messages in recovery.

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] Sending initial ORF token

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] CLM CONFIGURATION CHANGE

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] New Configuration:

Nov 13 11:39:19 cnfs01 kernel: dlm: closing connection to node 2

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ]   r(0) ip(192.168.15.1)

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] Members Left:

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ]   r(0) ip(192.168.15.2)

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] Members Joined:

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] CLM CONFIGURATION CHANGE

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] New Configuration:

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ]   r(0) ip(192.168.15.1)

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] Members Left:

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] Members Joined:

Nov 13 11:39:19 cnfs01 openais[5817]: [SYNC ] This node is within the primary component and will provide service.

Nov 13 11:39:19 cnfs01 openais[5817]: [TOTEM] entering OPERATIONAL state.

Nov 13 11:39:19 cnfs01 openais[5817]: [CLM  ] got nodejoin message 192.168.15.1

Nov 13 11:39:20 cnfs01 openais[5817]: [CPG  ] got joinlist message from node 1

Nov 13 11:39:21 cnfs01 fenced[5836]: ic-cnfs02 not a cluster member after 2 sec post_fail_delay

Nov 13 11:39:21 cnfs01 fenced[5836]: fencing node "ic-cnfs02"  

Cluster RPM versions

rgmanager-2.0.52-1.el5_4.2          

lvm2-cluster-2.02.46-8.el5_4.1                

cman-2.0.115-1.el5_4.3                        

openais-0.80.6-8.el5_4.1                      

kmod-gfs2-1.92-1.1.el5_2.2                    

gfs2-utils-0.1.62-1.el5                       

perl-Config-General-2.40-1.el5                

system-config-cluster-1.0.57-1.5              

ricci-0.12.2-6.el5                            

piranha-0.8.4-13.el5                          

luci-0.12.2-6.el5                          

cluster-snmp-0.12.1-2.el5                  

cluster-cim-0.12.1-2.el5                      

Cluster_Administration-en-US-5.2-1            

The cluster.conf file is below

<?xml version="1.0"?>

<cluster alias="cnfs_cluster" config_version="78" name="cnfs">

        <fence_daemon clean_start="0" post_fail_delay="2" post_join_delay="20"/>

        <clusternodes>

                <clusternode name="ic-cnfs01" nodeid="1" votes="1">

                        <fence>

                                <method name="1">

                                        <device name="IPMI_LAN_CNFS01"/>

                                </method>

                        </fence>

                </clusternode>

                <clusternode name="ic-cnfs02" nodeid="2" votes="1">

                        <fence>

                                <method name="1">

                                        <device name="IPMI_LAN_CNFS02"/>

                                </method>

                        </fence>

                </clusternode>

        </clusternodes>

        <cman expected_votes="1" two_node="1"/>

        <fencedevices>

                <fencedevice agent="fence_ipmilan" auth="" ipaddr="ipmi-cnfs01" login="root" name="I

PMI_LAN_CNFS01" passwd="Rocknro11"/>

                <fencedevice agent="fence_ipmilan" auth="" ipaddr="ipmi-cnfs02" login="root" name="I

PMI_LAN_CNFS02" passwd="Rocknro11"/>

        </fencedevices>

        <rm>

                <failoverdomains>

                        <failoverdomain name="failover-cnfs01-vip1" ordered="1" restricted="0">

                                <failoverdomainnode name="ic-cnfs01" priority="1"/>

                                <failoverdomainnode name="ic-cnfs02" priority="2"/>

                        </failoverdomain>

                        <failoverdomain name="failover-cnfs01-vip2" ordered="1" restricted="0">

                                <failoverdomainnode name="ic-cnfs01" priority="1"/>

                                <failoverdomainnode name="ic-cnfs02" priority="2"/>

                        </failoverdomain>

                        <failoverdomain name="failover-cnfs02-vip1" ordered="1" restricted="0">

                                <failoverdomainnode name="ic-cnfs02" priority="1"/>

                                <failoverdomainnode name="ic-cnfs01" priority="2"/>

                        </failoverdomain>

                        <failoverdomain name="failover-cnfs02-vip2" ordered="1" restricted="0">

                                <failoverdomainnode name="ic-cnfs01" priority="2"/>

                                <failoverdomainnode name="ic-cnfs02" priority="1"/>

                        </failoverdomain>

                        <failoverdomain name="failover-cnfs03-vip1" ordered="1" restricted="0">

                                <failoverdomainnode name="ic-cnfs01" priority="1"/>

                                <failoverdomainnode name="ic-cnfs02" priority="2"/>

                        </failoverdomain>

                        <failoverdomain name="failover-cnfs03-vip2" ordered="1" restricted="0">

                                <failoverdomainnode name="ic-cnfs01" priority="2"/>

                                <failoverdomainnode name="ic-cnfs02" priority="1"/>

                        </failoverdomain>

                </failoverdomains>

                <resources/>

                <service autostart="1" domain="failover-cnfs02-vip1" name="cnfs02-vip1" recovery="restart">

                        <ip address="172.19.130.154" monitor_link="1"/>

                </service>

                <service autostart="1" domain="failover-cnfs01-vip2" name="cnfs01-vip2" recovery="restart">

                        <ip address="172.19.130.156" monitor_link="1"/>

                </service>

                <service autostart="1" domain="failover-cnfs02-vip2" name="cnfs02-vip2" recovery="restart">

                        <ip address="172.19.130.157" monitor_link="1"/>

                </service>

                <service autostart="1" domain="failover-cnfs01-vip1" name="cnfs01-vip1" recovery="restart">

                        <ip address="172.19.130.153" monitor_link="1"/>

                </service>

                <service autostart="1" domain="failover-cnfs01-vip1" name="cnfs03-vip1" recovery="restart">

                        <ip address="172.19.130.155" monitor_link="1"/>

                </service>

                <service autostart="1" domain="failover-cnfs03-vip2" name="cnfs03-vip2" recovery="restart">

                        <ip address="172.19.130.158" monitor_link="1"/>

                </service>

        </rm>

</cluster>o

The openais.conf file is below

# Please read the openais.conf.5 manual page

totem {

        version: 2

        secauth: off

        threads: 0

        interface {

                ringnumber: 0

                bindnetaddr: 192.168.15.0

                mcastaddr: 226.94.1.1

                mcastport: 5405

        }

}

logging {

        debug: off

        timestamp: on

        to_syslog

}

amf {

        mode: disabled

}

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

 Jon Swift                   Pratt & Whitney Rocketdyne 

                             Unix Team Technical Lead

                             email  : jon.swift@xxxxxxxxxxx  

                             phone  : (818) 586-4029

                             pager  : (818) 328-4112

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster