TEQL for bonding Multi Gbit Ethernet in a cluster?

Wolfgang Rosner <wrosner@xxxxxxxxx> · Fri, 13 Mar 2015 22:26:38 +0100

Hello,

Can I use TEQL to aggreagate multiple Gbit ethernets in a multiple Switch 
Topology across multiple hosts?
In my example, 17 hosts each having 6 GBit ethernet cards?

Did anybody try and maybe even document such an approach?

I tried layer 2 bonding as described here
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology
but have to struggle with disappointing performance gains, a misbehaving 
switch layer and problems during PXE-DHCP-Boot.

Googling for a more controllable, all-linux, maybe layer 3 alternative, I 
encountered LARTC.
I think multirouting as in chapter 4.2.2 does not solve my problem, as I want 
to share bandwith for single large transfers, too.

I'd like to try the TEQL approach of chapter 10, but there are some open 
questions:

- How does the routing look like if I have 17 hosts connected by 6 interfaces 
each?

I think I cannot use the /31 net approach on a 1-to-1 basis, since I have 17 
machines on each subnet.
can I use /27 nets instead, allowing 30 hosts per subnet?

Or do I need a /31 subnet for each pair of machines, on each switch device,
which where a total of (17 x 16 /2) * 6 = 816  of /31 subnets?

Is this idea correct
-  one IP-addess for teql0 and 6 x 1 IP for eth0 ... eth5 on each host
	equals 7 x 17 = 119 IP addresses in total
- a route for each target on any physical interface on any host, pointing to 
the counterpart on the same subnet like

route add -host <teql-IP-on-target> gw <matching-dev-IP-on-target>

This still adds up to 16 peers x 6 Interfaces = 96 routes on each host. 
How does this affect performance?
Of course I can script this, but is there a more "elegant" way?
Like calculated / OR-ed filter addresses?

- can I continue to use the pyhsical links directly, particularly for 
PXE-booting?

- can I keep the switch configuration as one large network and let ARP/ layer 
3 sort out the details, or is it necessary/advantageous to configure all 
layer 3 subnets as seperate layer 2 Vlans as well?
Or do I even need 816 vlans for 816 /31 subnets on a peer-to-peer-basis?

- the clients run diskless on nfsroot, which is established by the dracut boot 
process.
So either I have to establish the whole teql within dracut during boot, or I 
have to reconfigure the nework after boot, without dropping the running 
nfsroot. Is this possible? 

- I only find reports and advices for 2.X kernels on the list archives.
Are there any advances on the TCP tuning issues in recent kernels?

- can I expect a performance gain at all, or will the additional CPU overhead 
outweight the gain in badwith?

- what are the recommended tools for testing and tuning?

=================================================

what I have done so far:

I'm just going to build a "poor man's beowulf cluster" from a bunch of used 
server parts, sourced on ebay.

So I end up with a HP blade center with 16 blade servers in it, each equipped 
with 6 x 1 GBit ehternet ports.
They are linked by HP Virtual connect ("VC") switch units, in the way that 
there are 6 VC, each with one port to every one of the blade servers.
This mapping is hardwired by the blade center design.
The VC ist administered and advertised like one large manageable switch, but 
with caveats, see below.

The whole thing is connected to the outside world via a consumer grade PC 
acting as a gateway and file server with 2x4=8 Gbit ethernet for the cluster 
side.

All boxes run on debian wheezy, with 3.19.0 vanilla on the gateway and 
debian 3.16.7-ckt4-3~bpo70+1 at the blades. 
Blades are bootet over DHCP/PXE/TFTP/nfsroot

Of course I would like to utilize the full available network bandwith for 
interprocess communication.

My first try was linux bonding with 802.3ad bonding policy.
see
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Bonding_Driver_Options

However, all traffic goes over one Interface only.
Maximum throughput is ~ 900 MBit / s.

Googling the issue, I learned that VC does not support LACP bonding across 
different VC modules, so they are only "little-bit-stackable-switches".

Next try was bonding with balance-rr as given here:
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology

To get the whole symmetry described there, 
I connected the external gateway with 6 ethernet ports to each of the 
VC-modules on a 1:1 basis. However, this breaks PXE booting, since the PXE 
machine does not appear to support bonding, so even the first DHCP breaks.

Current best setting is now having the blades on balancing-rr and the gateway 
connected by 8 parallel Gbit-links to one single VC-device and using LACP / 
802.3ad on this.

However, performance is still far beyond expectation:
~ 2.5 GBit between two blades, using nfs copy of 3 GBit files located in 
ramdisk
~ 0.9 GBit between server and blade via nfs copy
~ 2,8 GBit running  dbench -D /home 50 parallel on 16 clients 

I partially understand the last 2 figures as limitations of the 802.3ad LACP 
protocol.
I can see unequal load distribution in ifconfig stats.
I can watch periodical ups and downs during the 5 min dbench run, so I suspect 
some kind of a TCP congestion issue.

I still do not understand the limitations of the direct blade-to-blade 
transfer using the round-robin-policy. According to ifconfig, both incoming 
and outgoing traffic is equally distributed over all physical links.
I'm afraid this has anything to do with TCP reordering / slow-down / 
congestion window.

Thank you for any pointer.

Wolfgang Rosner
--
To unsubscribe from this list: send the line "unsubscribe lartc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

TEQL for bonding Multi Gbit Ethernet in a cluster?

Linux Advanced Routing and Traffic Control