On 21/12/15 04:32 PM, Ludovic Zammit wrote: > Hello, > > I'm running a centos 6.7 cluster of 2 nodes on a Hyper-V hypervisor. > Every day at 11PM a snapshot job save both servers. > The snapshotting process seems to cause a loss of connectivity between > the two nodes which results in the cluster partitioning and pacemaker to > start services on both nodes. You should have stonith enabled, configured and tested. > Then once the snapshotting is done, the two halves of the cluster are > able to see each other again and pacemaker chooses one on which to run > the services. > Unfortunately that means that our DRBD partition has been mounted on > both, so it now goes into « split brain mode » . Hook DRBD's fencing into pacemaker's with the crm-{un,}fence-peer.sh {un,}fence handlers and set fencing to 'resource-and-stonith'. This will prevent split-brains, regardless of the root cause. > When I was running corosync 1.4, I used to adjust the « token » variable > in the configuration file so that both nodes would wait longer before > detecting a loss of the other. > > Now that I have upgraded to corosync 2 (2.3.5 to be more precise) the > problem is back with a vengeance. This is not a supported configuration on EL6, so I'm not surprised that you'd seeing issues. In any case, fix stonith first and foremost. Then sort out the reason for corosync blocking. > I have tried the configuration below, with a a very high totem value, > and that resulted in the following errors (I have since reverted that > change): > > Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 > Process pause detected for 3464149 ms, flush > ing membership messages. > Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 > Process pause detected for 3464149 ms, flush > ing membership messages. > Dec 21 08:59:13 [16696] node1 corosync notice [TOTEM ] totemsrp.c:783 > Process pause detected for 3464199 ms, flush > ing membership messages. > > > What can I do to prevent the cluster splitting apart during those > nightly snapshots? > How do I manually set a long totem timeout without breaking everything else? Snapshots are generally a poor way to handle backups. The images created will be point-in-time recovery, *without* whatever was in RAM. So it would effectively be like recovering from a sudden power loss. *Usually* OK, but if something goes so wrong that you need to recover from backup, "usually" isn't good enough. So before anything, I would reconsider the snapshots entirely. > ====================================================================== > > Software version: > 2.6.32-573.7.1.el6.x86_64 > > corosync-2.3.5-1.el6.x86_64 > corosynclib-2.3.5-1.el6.x86_64 > > pacemaker-cluster-libs-1.1.13-1.el6.x86_64 > pacemaker-cli-1.1.13-1.el6.x86_64 > > kmod-microsoft-hyper-v-4.0.11-20150728.x86_64 > microsoft-hyper-v-4.0.11-20150728.x86_64 > > Configuration: > > totem { > version: 2 > > crypto_cipher: none > crypto_hash: none > clear_node_high_bit: yes > cluster_name: cluster > transport: udpu > token: 150000 > > interface { > ringnumber: 0 > bindnetaddr: 10.200.0.2 > mcastport: 5405 > ttl: 1 > } > } > > nodelist { > node { > ring0_addr: 10.200.0.2 > } > > node { > ring0_addr: 10.200.0.3 > } > } > > logging { > fileline: on > to_stderr: no > to_logfile: yes > logfile: /var/log/cluster/corosync.log > to_syslog: yes > debug: off > timestamp: on > logger_subsys { > subsys: QUORUM > debug: off > } > } > > > quorum { > provider: corosync_votequorum > two_node: 1 > } > > > > Thank you for your help, > — > > Ludovic Zammit > > > > > > > > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss