On 22/01/14 12:50 PM, Francois Gaudreault wrote:
Hi all,
I don't know if this has been addressed before, but I couldn't find
anything on a fast manner.
We have a corosync cluster to manage an active/passive MySQL service
with DRBD underneath. Those two servers are in fact VMs running on top
of two different XenServer hypervisors. The hypervisors are connected
with an LACP active-active link to a stacked switch.
What's happening is if we reboot a stack unit, the LACP will take some
time to flip the established sessions to the other link. This little
glitch is long enough to trigger a member lost in Corosync. You see the
rest, both nodes are master, and when network is back, DRBD split-brains.
Is there anything we can do to tolerate such failures which last around
20 to 30sec?
Last I checked, corosync didn't support LACP. In my network/switch
failure tests, (with both corosync and drbd running), I only found
mode=1 (active/passive) to reliably survive all failure and recovery
scenarios (inc. power-cycling switches, etc).
It could be that your switch is temporarily blocking all traffic to
check STP. You might want to try disabling STP and re-running your tests.
Also, if you have fencing setup properly, you won't get a split-brain
regardless.
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss