Re: Totem Process pause detected

"Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> · Tue, 22 Dec 2015 16:32:57 +0100

On 12/21/2015 10:32 PM, Ludovic Zammit wrote:
> Hello,
> 
> I'm running a centos 6.7 cluster of 2 nodes on a Hyper-V hypervisor. 
> Every day at 11PM a snapshot job save both servers.
> The snapshotting process seems to cause a loss of connectivity between
> the two nodes which results in the cluster partitioning and pacemaker to
> start services on both nodes.
> Then once the snapshotting is done, the two halves of the cluster are
> able to see each other again and pacemaker chooses one on which to run
> the services.
> Unfortunately that means that our DRBD partition has been mounted on
> both, so it now goes into «  split brain mode » .   
> 
> 
> When I was running corosync 1.4, I used to adjust the « token » variable
> in the configuration file so that both nodes would wait longer before
> detecting a loss of the other.
> 
> Now that I have upgraded to corosync 2 (2.3.5 to be more precise) the
> problem is back with a vengeance.
> 
> I have tried the configuration below, with a a very high totem value,
> and that resulted in the following errors (I have since reverted that
> change):

bad idea to increase totem timeout very high. It means that any fault
detection between nodes will take forever.

> 
> Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783
> Process pause detected for 3464149 ms, flush
> ing membership messages.
> Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783
> Process pause detected for 3464149 ms, flush
> ing membership messages.
> Dec 21 08:59:13 [16696] node1 corosync notice  [TOTEM ] totemsrp.c:783
> Process pause detected for 3464199 ms, flush
> ing membership messages.
> 
> 
> What can I do to prevent the cluster splitting apart during those
> nightly snapshots? 

Either use another backup method, or you need to stop the cluster on the
VM you are about to snapshot, take the snapshot, start the cluster
again, move to the next.

> How do I manually set a long totem timeout without breaking everything else?
> 

The problem has nothing to do with just totem timeout, the problem is
that the VM was frozen for at least ´3464199 ms´ without being scheduled
by the hypervisor. So even a very high token timeout, would not solve
the problem of services running on that specific VM NOT being available
during the snapshot.

Fabio

> 
> 
> 
> ======================================================================
> 
> Software version:
> 2.6.32-573.7.1.el6.x86_64
> 
> corosync-2.3.5-1.el6.x86_64
> corosynclib-2.3.5-1.el6.x86_64
> 
> pacemaker-cluster-libs-1.1.13-1.el6.x86_64
> pacemaker-cli-1.1.13-1.el6.x86_64
> 
> kmod-microsoft-hyper-v-4.0.11-20150728.x86_64
> microsoft-hyper-v-4.0.11-20150728.x86_64
> 
> Configuration:
> 
> totem {
>     version: 2
> 
>     crypto_cipher: none
>     crypto_hash: none
>     clear_node_high_bit: yes
>     cluster_name: cluster
>     transport: udpu
>     token: 150000
> 
>     interface {
>         ringnumber: 0
>         bindnetaddr: 10.200.0.2
>         mcastport: 5405
>         ttl: 1
>     }
> }
> 
> nodelist {
>     node {
>         ring0_addr:  10.200.0.2
>     }
> 
>     node {
>         ring0_addr:  10.200.0.3
>     }
> }
> 
> logging {
>     fileline: on
>     to_stderr: no
>     to_logfile: yes
>     logfile: /var/log/cluster/corosync.log
>     to_syslog: yes
>     debug: off
>     timestamp: on
>     logger_subsys {
>         subsys: QUORUM
>         debug: off
>     }
> }
> 
> 
> quorum {
>     provider: corosync_votequorum
>     two_node: 1
> }
> 
> 
> 
> Thank you for your help,
> — 
> 
> Ludovic Zammit
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> 
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss