Hello, I'm running a centos 6.7 cluster of 2 nodes on a Hyper-V hypervisor. Every day at 11PM a snapshot job save both servers. The snapshotting process seems to cause a loss of connectivity between the two nodes which results in the cluster partitioning and pacemaker to start services on both nodes. Then once the snapshotting is done, the two halves of the cluster are able to see each other again and pacemaker chooses one on which to run the services. Unfortunately that means that our DRBD partition has been mounted on both, so it now goes into « split brain mode » . When I was running corosync 1.4, I used to adjust the « token » variable in the configuration file so that both nodes would wait longer before detecting a loss of the other. Now that I have upgraded to corosync 2 (2.3.5 to be more precise) the problem is back with a vengeance. I have tried the configuration below, with a a very high totem value, and that resulted in the following errors (I have since reverted that change): Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 Process pause detected for 3464149 ms, flush ing membership messages. Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 Process pause detected for 3464149 ms, flush ing membership messages. Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 Process pause detected for 3464199 ms, flush ing membership messages. What can I do to prevent the cluster splitting apart during those nightly snapshots? How do I manually set a long totem timeout without breaking everything else? ====================================================================== Software version: 2.6.32-573.7.1.el6.x86_64 corosync-2.3.5-1.el6.x86_64 corosynclib-2.3.5-1.el6.x86_64 pacemaker-cluster-libs-1.1.13-1.el6.x86_64 pacemaker-cli-1.1.13-1.el6.x86_64 kmod-microsoft-hyper-v-4.0.11-20150728.x86_64 microsoft-hyper-v-4.0.11-20150728.x86_64 Configuration: totem { version: 2 crypto_cipher: none crypto_hash: none clear_node_high_bit: yes cluster_name: cluster transport: udpu token: 150000 interface { ringnumber: 0 bindnetaddr: 10.200.0.2 mcastport: 5405 ttl: 1 } } nodelist { node { ring0_addr: 10.200.0.2 } node { ring0_addr: 10.200.0.3 } } logging { fileline: on to_stderr: no to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes debug: off timestamp: on logger_subsys { subsys: QUORUM debug: off } } quorum { provider: corosync_votequorum two_node: 1 } Thank you for your help, — Ludovic Zammit |
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss