Re: bandwidth aggregation between 2 hosts in the same subnet

Jay Vosburgh <fubar@xxxxxxxxxx> · Tue, 31 Jul 2007 12:58:18 -0700

Grant Taylor <gtaylor@xxxxxxxxxxxxxxxxx> wrote:

>On 07/31/07 06:01, Ralf Gross wrote:
>> But I don't have an isolated network. Maybe I'm still too blind to see a
>> simple solution.

	There really isn't a simple solution, since you're not doing
something simple.  It sounds simple to say you want to aggregate
bandwidth from multiple interfaces for use by one TCP connection, but
it's actually a pretty complicated problem to solve.

	The diagram and description in the bonding documentation
describing the isolated network is really meant for use in clusters, and
is more historical than anything else these days.  In the days of yore,
it was fairly cost effective to connect several switches to several
systems such that each system had one port into each switch (as opposed
to buying a single, much larger, switch).  With no packet coalescing or
the like, balance-rr would tend to deliver packets in order to the end
systems (one packet per interrupt), and a given connection could get
pretty close to full striped throughput.

	This type of arrangement breaks down with modern network
hardware, since there is no longer a one-to-one relationship between
interrupts and packet arrival.

>The fact that you are trying to go across an aggregated link in the middle
>between the two buildings where you have no control is going to hinder you
>severely.

	Yes.  You're also running up against the fact that,
traditionally, Etherchannel (and equivalents) is generally meant to
aggregate trunks, optimizing for overall maximum throughput across
multiple connections.  It's not really optimized to permit a single
connection to effectively utilize the combined bandwidth of multiple
links.

>The only other nasty thing that comes to mind is to assign additional MAC
>/ IP sets to each system on their second interfaces.

	Another similar Rube Goldberg sort of scheme I've set up in the
past (in the lab, for bonding testing, not in a production environment,
your mileage may vary, etc, etc) is to dedicate particular switch ports
to particular vlans.  So, e.g.,

linux box eth0 ---- port 1:vlan 99 SWITCH(ES) port2:vlan 99 ---- eth0 linux box
bond0     eth1 ---- port 3:vlan 88 SWITCH(ES) port4:vlan 88 ---- eth1 bond0

	This sort of arrangement requires setting the Cisco switch ports
to be native to a particular vlan, e.g., "switchport mode access",
"switchport access vlan 88".  Theoretically, the intervening switches
will simply pass the vlan traffic through and not decapsulate it until
it reaches its end destination port.  You might also have to fool with
the inter-switch links to make sure they're trunking properly (to pass
the vlan traffic).

	The downside of this sort of scheme is that the bond0 instances
can only communicate with each other, unless you have the ability for
one of the intermediate switches to route between the vlan and the
regular network, or you have some other host also attached to the vlans
to act as a gateway to the rest of the network.  My switches won't
route, since they're switch-only models (2960/2970/3550), with no layer
3 capability, and I've never tried setting up a separate gateway host in
such a configuration.

	This also won't work if the intervening switches either (a)
don't have higher capacity inter-switch links or (b) don't spread the
traffic across the ISLs any better than they do on a regular
etherchannel.

	Basically, you want to take the switches out of the equation (so
the load balance algorithm used by etherchannel doesn't disturb the even
balance of the round robin transmission).  There might be other ways to
essentially tunnel from port 1 to 2 and 3 to 4 (in my diagram above),
but that's really what you're looking to do.

	Lastly, as long as I'm here, I can give my usual commentary
about TCP packet reordering.  The bonding balance-rr mode will generally
deliver packets out of order (to an aggregated destination; if you feed
a balance-rr of N links at speed X into a single link with enough
capacity to handle N * X bandwidth, you don't see this problem).  This
is ignoring any port assignment a switch might do.

	TCP's action upon receiving packets out of order is typically to
issue an ACK indicating a lost segment (fast retransmit; by default,
after 3 segments arrive out of order).  On linux, this threshold can be
adjusted via the net.ipv4.tcp_reordering sysctl.  Crank it up to 127 or
so and the reordering effect is minimized, although there are other
congestion control effects.

	The bottom line is that you won't ever see N * X bandwidth on a
single TCP connection, and the improvement factor falls off as the
number of links in the aggregate increases.  With four links, you're
doing pretty good to get about 2.3 links worth of throughput.  If memory
serves, with two links you top out around 1.5.

	So, the real question is: Since you've got two links, how
important is that 0.5 improvement in transfer speed?  Can you instead
figure out a way to split your backup problem into pieces, and run them
concurrently?  

	That can be a much easier problem to tackle, given that it's
trivial to add extra IP addresses to the hosts on each end, and
presumably your higher end Cisco gear will permit a load-balance
algorithm other than straight MAC address XOR.  E.g., the 2960 I've got
handy permits:

slime(config)#port-channel load-balance ?
  dst-ip       Dst IP Addr
  dst-mac      Dst Mac Addr
  src-dst-ip   Src XOR Dst IP Addr
  src-dst-mac  Src XOR Dst Mac Addr
  src-ip       Src IP Addr
  src-mac      Src Mac Addr

	so it's possible to get the IP address into the port selection
math, and adding IP addresses is pretty straightforward.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@xxxxxxxxxx
_______________________________________________
LARTC mailing list
LARTC@xxxxxxxxxxxxxxx
http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc

Re: bandwidth aggregation between 2 hosts in the same subnet

Linux Advanced Routing and Traffic Control