Totem Process pause detected

Ludovic Zammit <lzammit@xxxxxxxxxx> · Mon, 21 Dec 2015 16:32:48 -0500

Hello,

I'm running a centos 6.7 cluster of 2 nodes on a Hyper-V hypervisor. 
Every day at 11PM a snapshot job save both servers.
The snapshotting process seems to cause a loss of connectivity between the two nodes which results in the cluster partitioning and pacemaker to start services on both nodes.
Then once the snapshotting is done, the two halves of the cluster are able to see each other again and pacemaker chooses one on which to run the services.
Unfortunately that means that our DRBD partition has been mounted on both, so it now goes into «  split brain mode » .   

When I was running corosync 1.4, I used to adjust the « token » variable in the configuration file so that both nodes would wait longer before detecting a loss of the other.

Now that I have upgraded to corosync 2 (2.3.5 to be more precise) the problem is back with a vengeance.

I have tried the configuration below, with a a very high totem value, and that resulted in the following errors (I have since reverted that change):

Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783 Process pause detected for 3464149 ms, flush
ing membership messages.
Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783 Process pause detected for 3464149 ms, flush
ing membership messages.
Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783 Process pause detected for 3464199 ms, flush
ing membership messages.

What can I do to prevent the cluster splitting apart during those nightly snapshots? 
How do I manually set a long totem timeout without breaking everything else?

======================================================================

Software version:
2.6.32-573.7.1.el6.x86_64

corosync-2.3.5-1.el6.x86_64
corosynclib-2.3.5-1.el6.x86_64

pacemaker-cluster-libs-1.1.13-1.el6.x86_64
pacemaker-cli-1.1.13-1.el6.x86_64

kmod-microsoft-hyper-v-4.0.11-20150728.x86_64
microsoft-hyper-v-4.0.11-20150728.x86_64

Configuration:

totem {
    version: 2

    crypto_cipher: none
    crypto_hash: none
    clear_node_high_bit: yes
    cluster_name: cluster
    transport: udpu
    token: 150000

    interface {
        ringnumber: 0
        bindnetaddr: 10.200.0.2
        mcastport: 5405
        ttl: 1
    }
}

nodelist {
    node {
        ring0_addr:  10.200.0.2
    }

    node {
        ring0_addr:  10.200.0.3
    }
}

logging {
    fileline: on
    to_stderr: no
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    debug: off
    timestamp: on
    logger_subsys {
        subsys: QUORUM
        debug: off
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

Thank you for your help,
— 
Ludovic Zammit

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss