reasons for sporadic token loss?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi together!

I am experiencing sporadic problems with my cluster setup. Maybe someone has an idea? But first some facts:

Type: RHEL 6.1 two node cluster (corosync 1.2.3-36) on two Dell R610 each with a quad port NIC

NICs:
- interfaces em1/em2 are bonded using mode 5; these interfaces are cross connected (intended to be used for the cluster housekeeping communication) - no network element in between - interfaces em3/em4 are bonded using mode 1; these interfaces are connected to two switches

Cluster configuration:

<?xml version="1.0"?>
<cluster config_version="51" name="my-cluster">
    <cman expected_votes="1" two_node="1"/>
    <clusternodes>
        <clusternode name="df1-clusterlink" nodeid="1">
            <fence>
                <method name="VBoxManage-DF-1">
                    <device name="VBoxManage-DF-1" />
                </method>
            </fence>
            <unfence>
            </unfence>
        </clusternode>
        <clusternode name="df2-clusterlink" nodeid="2">
            <fence>
                <method name="VBoxManage-DF-2">
                    <device name="VBoxManage-DF-2" />
                </method>

            </fence>
            <unfence>
            </unfence>
        </clusternode>
    </clusternodes>
    <fencedevices>
<fencedevice name="VBoxManage-DF-1" agent="fence_vbox" vboxhost="vboxhost.private" login="test" vmname="RHEL 6.1 x86_64 DF-System Server 1" /> <fencedevice name="VBoxManage-DF-2" agent="fence_vbox" vboxhost="vboxhost.private" login="test" vmname="RHEL 6.1 x86_64 DF-System Server 2" />
    </fencedevices>
    <rm>
        <resources>
<ip address="10.200.104.15/27" monitor_link="on" sleeptime="10"/>
            <script file="/usr/share/cluster/app.sh" name="myapp"/>
        </resources>
        <failoverdomains>
<failoverdomain name="fod-myapp" nofailback="0" ordered="1" restricted="0">
                <failoverdomainnode name="df1-clusterlink" priority="1"/>
                <failoverdomainnode name="df2-clusterlink" priority="2"/>
            </failoverdomain>
        </failoverdomains>
<service domain="fod-myapp" exclusive="1" max_restarts="3" name="rg-myapp" recovery="restart" restart_expire_time="1">
            <script ref=myapp"/>
            <ip ref="10.200.104.15/27"/>
        </service>
    </rm>
    <logging debug="on"/>
    <gfs_controld enable_plock="0" plock_rate_limit="0"/>
    <dlm enable_plock="0" plock_ownership="1" plock_rate_limit="0"/>
</cluster>


--------------------------------------------------------------------------------

Problem:
Sometimes the second node "detects" that the token has been lost (corosync.log):

[no TOTEM messages before that]
Jul 28 13:00:10 corosync [TOTEM ] The token was lost in the OPERATIONAL state. Jul 28 13:00:10 corosync [TOTEM ] A processor failed, forming new configuration. Jul 28 13:00:10 corosync [TOTEM ] Receive multicast socket recv buffer size (262142 bytes). Jul 28 13:00:10 corosync [TOTEM ] Transmit multicast socket send buffer size (262142 bytes).

This happens lets say once a week. This leads to fencing of the first node. What I see from 'corosync-objctl -a' is that this is maybe due to a consensus timeout (some excerpt from the commands output follows); I marked the lines which I so far consider as important:

totem.transport=udp
totem.version=2
totem.nodeid=2
totem.vsftype=none
totem.token=10000
totem.join=60
totem.fail_recv_const=2500
totem.consensus=2000
totem.rrp_mode=none
totem.secauth=1
totem.key=my-cluster
totem.interface.ringnumber=0
totem.interface.bindnetaddr=172.16.42.2
totem.interface.mcastaddr=239.192.187.168
totem.interface.mcastport=5405
runtime.totem.pg.mrp.srp.orf_token_tx=3
runtime.totem.pg.mrp.srp.orf_token_rx=1103226
runtime.totem.pg.mrp.srp.memb_merge_detect_tx=395
runtime.totem.pg.mrp.srp.memb_merge_detect_rx=1098359
runtime.totem.pg.mrp.srp.memb_join_tx=38
runtime.totem.pg.mrp.srp.memb_join_rx=50
runtime.totem.pg.mrp.srp.mcast_tx=218
runtime.totem.pg.mrp.srp.mcast_retx=0
runtime.totem.pg.mrp.srp.mcast_rx=541
runtime.totem.pg.mrp.srp.memb_commit_token_tx=12
runtime.totem.pg.mrp.srp.memb_commit_token_rx=18
runtime.totem.pg.mrp.srp.token_hold_cancel_tx=49
runtime.totem.pg.mrp.srp.token_hold_cancel_rx=173
runtime.totem.pg.mrp.srp.operational_entered=6
runtime.totem.pg.mrp.srp.operational_token_lost=1
^^^
runtime.totem.pg.mrp.srp.gather_entered=7
runtime.totem.pg.mrp.srp.gather_token_lost=0
runtime.totem.pg.mrp.srp.commit_entered=6
runtime.totem.pg.mrp.srp.commit_token_lost=0
runtime.totem.pg.mrp.srp.recovery_entered=6
runtime.totem.pg.mrp.srp.recovery_token_lost=0
runtime.totem.pg.mrp.srp.consensus_timeouts=1
^^^
runtime.totem.pg.mrp.srp.mtt_rx_token=1727
runtime.totem.pg.mrp.srp.avg_token_workload=62244458
runtime.totem.pg.mrp.srp.avg_backlog_calc=0
runtime.totem.pg.mrp.srp.rx_msg_dropped=0
runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(172.16.42.2)
runtime.totem.pg.mrp.srp.members.2.join_count=1
runtime.totem.pg.mrp.srp.members.2.status=joined
runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(172.16.42.1)
runtime.totem.pg.mrp.srp.members.1.join_count=3
runtime.totem.pg.mrp.srp.members.1.status=joined
runtime.blackbox.dump_flight_data=no
runtime.blackbox.dump_state=no

Some questions at this point:
A) why did the cluster lose the token? due to timeout? token (10000) or consensus (2000)? B) why is the timeout ellapsed? maybe that is connected with the answer to A ... ? C) is it normal that 'token=10000' and 'consensus=2000' although normal documentation says that default is 'token=1000' and 'consensus=1.2*token'? D) since I suspect problems concerning the switches connecting the other interfaces (em3/em4 bonded to bond0) of those machines I wonder whether any traffic goes that way and not via bond1?

As I already stated: the connection of em3/em4 is a direct one without any network element.

So far I want to add the following line to cluster.conf and see whether the situation improves:

<totem token_retransmits_before_loss_const="10" fail_recv_const="100" consensus="12000"/>

Any comment concerning that?

While googling for reasons I have seen that it is also a problem if both nodes are not synchronized concerning time; but in my case the ntpd on both nodes uses two stratum 2 NTP servers. I also cannot detect anything unsual like e.g. a jump of multiple seconds inside the log files although I have to admit that so far the ntpd does not run with debug enabled.


Thanks in advance for any hint or comment!


Kind regards,

    Heiko

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster


[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux