Re: Corosync + DRBD and network glitch

Digimer <lists@xxxxxxxxxx> · Wed, 22 Jan 2014 22:27:20 -0500

No worries. Here's the definition from the kernel docs[1]:

============
mode

	Specifies one of the bonding policies. The default is
	balance-rr (round robin).  Possible values are:

	balance-rr or 0

		Round-robin policy: Transmit packets in sequential
		order from the first available slave through the
		last.  This mode provides load balancing and fault
		tolerance.

	active-backup or 1

		Active-backup policy: Only one slave in the bond is
		active.  A different slave becomes active if, and only
		if, the active slave fails.  The bond's MAC address is
		externally visible on only one port (network adapter)
		to avoid confusing the switch.

		In bonding version 2.6.2 or later, when a failover
		occurs in active-backup mode, bonding will issue one
		or more gratuitous ARPs on the newly active slave.
		One gratuitous ARP is issued for the bonding master
		interface and each VLAN interfaces configured above
		it, provided that the interface has at least one IP
		address configured.  Gratuitous ARPs issued for VLAN
		interfaces are tagged with the appropriate VLAN id.

		This mode provides fault tolerance.  The primary
		option, documented below, affects the behavior of this
		mode.

	balance-xor or 2

		XOR policy: Transmit based on the selected transmit
		hash policy.  The default policy is a simple [(source
		MAC address XOR'd with destination MAC address) modulo
		slave count].  Alternate transmit policies may be
		selected via the xmit_hash_policy option, described
		below.

		This mode provides load balancing and fault tolerance.
============

1. https://www.kernel.org/doc/Documentation/networking/bonding.txt

On 22/01/14 03:20 PM, Francois Gaudreault wrote:
Sorry for my ignorance, what do you mean with mode 0 or 2?

FG

On 1/22/2014, 3:05 PM, Digimer wrote:
I know that, recently, mode=0 and mode=2 support was added, maybe
they're better?

On 22/01/14 03:02 PM, Francois Gaudreault wrote:
Well LACP is at the hypervisor level, so for Corosync, it's a standard
interface.

Active/Passive is not really an option for us, we need the 2GB
bandwidth. Any timeouts you think we can tweak?

FG

On 1/22/2014, 2:33 PM, Digimer wrote:
On 22/01/14 12:50 PM, Francois Gaudreault wrote:
Hi all,

I don't know if this has been addressed before, but I couldn't find
anything on a fast manner.

We have a corosync cluster to manage an active/passive MySQL service
with DRBD underneath. Those two servers are in fact VMs running on top
of two different XenServer hypervisors. The hypervisors are connected
with an LACP active-active link to a stacked switch.

What's happening is if we reboot a stack unit, the LACP will take some
time to flip the established sessions to the other link. This little
glitch is long enough to trigger a member lost in Corosync. You see
the
rest, both nodes are master, and when network is back, DRBD
split-brains.

Is there anything we can do to tolerate such failures which last
around
20 to 30sec?

Last I checked, corosync didn't support LACP. In my network/switch
failure tests, (with both corosync and drbd running), I only found
mode=1 (active/passive) to reliably survive all failure and recovery
scenarios (inc. power-cycling switches, etc).

It could be that your switch is temporarily blocking all traffic to
check STP. You might want to try disabling STP and re-running your
tests.

Also, if you have fencing setup properly, you won't get a split-brain
regardless.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss