reasons for sporadic token loss?

Heiko Nardmann <heiko.nardmann@xxxxxxxxxxxxx> · Tue, 31 Jul 2012 15:57:33 +0200

Hi together!

I am experiencing sporadic problems with my cluster setup. Maybe someone 
has an idea? But first some facts:

Type: RHEL 6.1 two node cluster (corosync 1.2.3-36) on two Dell R610 
each with a quad port NIC

NICs:
- interfaces em1/em2 are bonded using mode 5; these interfaces are cross 
connected (intended to be used for the cluster housekeeping 
communication) - no network element in between
- interfaces em3/em4 are bonded using mode 1; these interfaces are 
connected to two switches

Cluster configuration:

<?xml version="1.0"?>
<cluster config_version="51" name="my-cluster">
    <cman expected_votes="1" two_node="1"/>
    <clusternodes>
        <clusternode name="df1-clusterlink" nodeid="1">
            <fence>
                <method name="VBoxManage-DF-1">
                    <device name="VBoxManage-DF-1" />
                </method>
            </fence>
            <unfence>
            </unfence>
        </clusternode>
        <clusternode name="df2-clusterlink" nodeid="2">
            <fence>
                <method name="VBoxManage-DF-2">
                    <device name="VBoxManage-DF-2" />
                </method>

            </fence>
            <unfence>
            </unfence>
        </clusternode>
    </clusternodes>
    <fencedevices>
        <fencedevice name="VBoxManage-DF-1" agent="fence_vbox" 
vboxhost="vboxhost.private" login="test" vmname="RHEL 6.1 x86_64 
DF-System Server 1" />
        <fencedevice name="VBoxManage-DF-2" agent="fence_vbox" 
vboxhost="vboxhost.private" login="test" vmname="RHEL 6.1 x86_64 
DF-System Server 2" />
    </fencedevices>
    <rm>
        <resources>
            <ip address="10.200.104.15/27" monitor_link="on" 
sleeptime="10"/>
            <script file="/usr/share/cluster/app.sh" name="myapp"/>
        </resources>
        <failoverdomains>
            <failoverdomain name="fod-myapp" nofailback="0" ordered="1" 
restricted="0">
                <failoverdomainnode name="df1-clusterlink" priority="1"/>
                <failoverdomainnode name="df2-clusterlink" priority="2"/>
            </failoverdomain>
        </failoverdomains>
        <service domain="fod-myapp" exclusive="1" max_restarts="3" 
name="rg-myapp" recovery="restart" restart_expire_time="1">
            <script ref=myapp"/>
            <ip ref="10.200.104.15/27"/>
        </service>
    </rm>
    <logging debug="on"/>
    <gfs_controld enable_plock="0" plock_rate_limit="0"/>
    <dlm enable_plock="0" plock_ownership="1" plock_rate_limit="0"/>
</cluster>

--------------------------------------------------------------------------------

Problem:
Sometimes the second node "detects" that the token has been lost 
(corosync.log):

[no TOTEM messages before that]
Jul 28 13:00:10 corosync [TOTEM ] The token was lost in the OPERATIONAL 
state.
Jul 28 13:00:10 corosync [TOTEM ] A processor failed, forming new 
configuration.
Jul 28 13:00:10 corosync [TOTEM ] Receive multicast socket recv buffer 
size (262142 bytes).
Jul 28 13:00:10 corosync [TOTEM ] Transmit multicast socket send buffer 
size (262142 bytes).

This happens lets say once a week. This leads to fencing of the first 
node. What I see from 'corosync-objctl -a' is that this is maybe due to 
a consensus timeout (some excerpt from the commands output follows); I 
marked the lines which I so far consider as important:

totem.transport=udp
totem.version=2
totem.nodeid=2
totem.vsftype=none
totem.token=10000
totem.join=60
totem.fail_recv_const=2500
totem.consensus=2000
totem.rrp_mode=none
totem.secauth=1
totem.key=my-cluster
totem.interface.ringnumber=0
totem.interface.bindnetaddr=172.16.42.2
totem.interface.mcastaddr=239.192.187.168
totem.interface.mcastport=5405
runtime.totem.pg.mrp.srp.orf_token_tx=3
runtime.totem.pg.mrp.srp.orf_token_rx=1103226
runtime.totem.pg.mrp.srp.memb_merge_detect_tx=395
runtime.totem.pg.mrp.srp.memb_merge_detect_rx=1098359
runtime.totem.pg.mrp.srp.memb_join_tx=38
runtime.totem.pg.mrp.srp.memb_join_rx=50
runtime.totem.pg.mrp.srp.mcast_tx=218
runtime.totem.pg.mrp.srp.mcast_retx=0
runtime.totem.pg.mrp.srp.mcast_rx=541
runtime.totem.pg.mrp.srp.memb_commit_token_tx=12
runtime.totem.pg.mrp.srp.memb_commit_token_rx=18
runtime.totem.pg.mrp.srp.token_hold_cancel_tx=49
runtime.totem.pg.mrp.srp.token_hold_cancel_rx=173
runtime.totem.pg.mrp.srp.operational_entered=6
runtime.totem.pg.mrp.srp.operational_token_lost=1
^^^
runtime.totem.pg.mrp.srp.gather_entered=7
runtime.totem.pg.mrp.srp.gather_token_lost=0
runtime.totem.pg.mrp.srp.commit_entered=6
runtime.totem.pg.mrp.srp.commit_token_lost=0
runtime.totem.pg.mrp.srp.recovery_entered=6
runtime.totem.pg.mrp.srp.recovery_token_lost=0
runtime.totem.pg.mrp.srp.consensus_timeouts=1
^^^
runtime.totem.pg.mrp.srp.mtt_rx_token=1727
runtime.totem.pg.mrp.srp.avg_token_workload=62244458
runtime.totem.pg.mrp.srp.avg_backlog_calc=0
runtime.totem.pg.mrp.srp.rx_msg_dropped=0
runtime.totem.pg.mrp.srp.members.2.ip=r(0) ip(172.16.42.2)
runtime.totem.pg.mrp.srp.members.2.join_count=1
runtime.totem.pg.mrp.srp.members.2.status=joined
runtime.totem.pg.mrp.srp.members.1.ip=r(0) ip(172.16.42.1)
runtime.totem.pg.mrp.srp.members.1.join_count=3
runtime.totem.pg.mrp.srp.members.1.status=joined
runtime.blackbox.dump_flight_data=no
runtime.blackbox.dump_state=no

Some questions at this point:
A) why did the cluster lose the token? due to timeout? token (10000) or 
consensus (2000)?
B) why is the timeout ellapsed? maybe that is connected with the answer 
to A ... ?
C) is it normal that 'token=10000' and 'consensus=2000' although normal 
documentation says that default is 'token=1000' and 'consensus=1.2*token'?
D) since I suspect problems concerning the switches connecting the other 
interfaces (em3/em4 bonded to bond0) of those machines I wonder whether 
any traffic goes that way and not via bond1?

As I already stated: the connection of em3/em4 is a direct one without 
any network element.

So far I want to add the following line to cluster.conf and see whether 
the situation improves:

    <totem token_retransmits_before_loss_const="10" 
fail_recv_const="100" consensus="12000"/>

Any comment concerning that?

While googling for reasons I have seen that it is also a problem if both 
nodes are not synchronized concerning time; but in my case the ntpd on 
both nodes uses two stratum 2 NTP servers. I also cannot detect anything 
unsual like e.g. a jump of multiple seconds inside the log files 
although I have to admit that so far the ntpd does not run with debug 
enabled.

Thanks in advance for any hint or comment!

Kind regards,

    Heiko

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster