Re: TEQL for bonding Multi Gbit Ethernet in a cluster?

Jay Vosburgh <jay.vosburgh@xxxxxxxxxxxxx> · Fri, 13 Mar 2015 17:39:19 -0700

Wolfgang Rosner <wrosner@xxxxxxxxx> wrote:

>Hello,
>
>
>Can I use TEQL to aggreagate multiple Gbit ethernets in a multiple Switch 
>Topology across multiple hosts?
>In my example, 17 hosts each having 6 GBit ethernet cards?
>
>Did anybody try and maybe even document such an approach?
>
>I tried layer 2 bonding as described here
>http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology
>but have to struggle with disappointing performance gains, a misbehaving 
>switch layer and problems during PXE-DHCP-Boot.

	That text in the bonding documentation is fairly old, and
describes a configuration that is not common today.  It worked at the
time because the three switches did not communicate, and the hardware of
the era delivered one packet per receive interrupt (think 10 Mb/sec).
The round-robin delivery of packets across interfaces generally stayed
in sync, as there was no packet coalescing on the receive side (no NAPI
in the kernel, either).  The switches could be cheap unmamaged switches,
as there were no channel groups on any particular switch, and no sharing
of MAC tables between them.

	It doesn't work well today, if for no other reason than
interrupt coalescing and NAPI on the receiver will induce serious out of
order delivery, and turning that off is not really an option.

>Googling for a more controllable, all-linux, maybe layer 3 alternative, I 
>encountered LARTC.
>I think multirouting as in chapter 4.2.2 does not solve my problem, as I want 
>to share bandwith for single large transfers, too.
>
>I'd like to try the TEQL approach of chapter 10, but there are some open 
>questions:
>
>- How does the routing look like if I have 17 hosts connected by 6 interfaces 
>each?
>
>I think I cannot use the /31 net approach on a 1-to-1 basis, since I have 17 
>machines on each subnet.
>can I use /27 nets instead, allowing 30 hosts per subnet?
>
>Or do I need a /31 subnet for each pair of machines, on each switch device,
>which where a total of (17 x 16 /2) * 6 = 816  of /31 subnets?
>
>Is this idea correct
>-  one IP-addess for teql0 and 6 x 1 IP for eth0 ... eth5 on each host
>	equals 7 x 17 = 119 IP addresses in total
>- a route for each target on any physical interface on any host, pointing to 
>the counterpart on the same subnet like
>
>route add -host <teql-IP-on-target> gw <matching-dev-IP-on-target>
>
>This still adds up to 16 peers x 6 Interfaces = 96 routes on each host. 
>How does this affect performance?
>Of course I can script this, but is there a more "elegant" way?
>Like calculated / OR-ed filter addresses?
>
>- can I continue to use the pyhsical links directly, particularly for 
>PXE-booting?
>
>- can I keep the switch configuration as one large network and let ARP/ layer 
>3 sort out the details, or is it necessary/advantageous to configure all 
>layer 3 subnets as seperate layer 2 Vlans as well?
>Or do I even need 816 vlans for 816 /31 subnets on a peer-to-peer-basis?
>
>- the clients run diskless on nfsroot, which is established by the dracut boot 
>process.
>So either I have to establish the whole teql within dracut during boot, or I 
>have to reconfigure the nework after boot, without dropping the running 
>nfsroot. Is this possible? 
>
>- I only find reports and advices for 2.X kernels on the list archives.
>Are there any advances on the TCP tuning issues in recent kernels?
>
>- can I expect a performance gain at all, or will the additional CPU overhead 
>outweight the gain in badwith?
>
>- what are the recommended tools for testing and tuning?
>
>=================================================
>
>what I have done so far:
>
>I'm just going to build a "poor man's beowulf cluster" from a bunch of used 
>server parts, sourced on ebay.
>
>So I end up with a HP blade center with 16 blade servers in it, each equipped 
>with 6 x 1 GBit ehternet ports.
>They are linked by HP Virtual connect ("VC") switch units, in the way that 
>there are 6 VC, each with one port to every one of the blade servers.
>This mapping is hardwired by the blade center design.
>The VC ist administered and advertised like one large manageable switch, but 
>with caveats, see below.
>
>The whole thing is connected to the outside world via a consumer grade PC 
>acting as a gateway and file server with 2x4=8 Gbit ethernet for the cluster 
>side.
>
>All boxes run on debian wheezy, with 3.19.0 vanilla on the gateway and 
>debian 3.16.7-ckt4-3~bpo70+1 at the blades. 
>Blades are bootet over DHCP/PXE/TFTP/nfsroot
>
>Of course I would like to utilize the full available network bandwith for 
>interprocess communication.
>
>My first try was linux bonding with 802.3ad bonding policy.
>see
>http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Bonding_Driver_Options
>
>However, all traffic goes over one Interface only.
>Maximum throughput is ~ 900 MBit / s.
>
>Googling the issue, I learned that VC does not support LACP bonding across 
>different VC modules, so they are only "little-bit-stackable-switches".
>
>Next try was bonding with balance-rr as given here:
>http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding#Maximum_Throughput_in_a_Multiple_Switch_Topology
>
>To get the whole symmetry described there, 
>I connected the external gateway with 6 ethernet ports to each of the 
>VC-modules on a 1:1 basis. However, this breaks PXE booting, since the PXE 
>machine does not appear to support bonding, so even the first DHCP breaks.
>
>Current best setting is now having the blades on balancing-rr and the gateway 
>connected by 8 parallel Gbit-links to one single VC-device and using LACP / 
>802.3ad on this.

	If you're testing your single stream throughput through this
LACP aggregation, you'll be limited by the throughput of one member of
that aggregation, as LACP will not stripe traffic.

>However, performance is still far beyond expectation:
>~ 2.5 GBit between two blades, using nfs copy of 3 GBit files located in 
>ramdisk
>~ 0.9 GBit between server and blade via nfs copy
>~ 2,8 GBit running  dbench -D /home 50 parallel on 16 clients 
>
>I partially understand the last 2 figures as limitations of the 802.3ad LACP 
>protocol.

	Most link aggregation systems will keep packets for a given
conversation (connection) on just one aggregation member, specifically
to prevent reordering of packets.  On linux, the bonding balance-rr mode
is the exception; the other modes use some type of hash or assignment to
determine the interface to transmit on, and won't stripe across multiple
interfaces.

	Another issue is that, even if you round-robin from the host's
bond, if traffic has to transit through a switch aggregation (channel
group), it will rebalance the traffic on egress, and most likely funnel
it all back through a single switch port.

>I can see unequal load distribution in ifconfig stats.
>I can watch periodical ups and downs during the 5 min dbench run, so I suspect 
>some kind of a TCP congestion issue.
>
>I still do not understand the limitations of the direct blade-to-blade 
>transfer using the round-robin-policy. According to ifconfig, both incoming 
>and outgoing traffic is equally distributed over all physical links.
>I'm afraid this has anything to do with TCP reordering / slow-down / 
>congestion window.

	Depending on your kernel, etc, you may be able to inspect some
reordering detection counters; netstat -s may report them, e.g.,

% netstat -s|grep -i reord
    Detected reordering 20 times using time stamp

	or you can hunt for the raw values in /proc/net/netstat or use
nstat to print them:

TcpExt:  TCPFACKReorder                         0
TcpExt:  TCPSACKReorder                         0
TcpExt:  TCPRenoReorder                         0
TcpExt:  TCPTSReorder                          20

	You may also be able to tweak some interface paramaters and
improve things; I'll point you at this discussion from a few years ago:

http://lists.openwall.net/netdev/2011/08/25/88

	I haven't tried what's described in the email in the
one-switch-per-interface sort of arrangement that blade environments
impose, and never really got bonding to work well for load balancing in
those type of environments.

	One issue for production use was that if a switch port fails on
one of the switches, the other peers sending traffic into that switch
will lose any packets sent to the failed port because their local link
is up, even though a particular peer isn't reachable.  That brings up
various cascade failover sorts of problems, or just interconnecting all
of the switches, which then gets confused by the bond's traffic wherein
the source MAC is the same for all interfaces.

	-J

---
	-Jay Vosburgh, jay.vosburgh@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe lartc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TEQL for bonding Multi Gbit Ethernet in a cluster?

Linux Advanced Routing and Traffic Control