Hello, the good new in short: IT WORKS I get 5.58 GBit / sec over 6 x 1 GBit between my blade nodes, using layer 3 teql link aggregation: root@blade-002:~# iperf -c 192.168.130.225 ........ [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 6.49 GBytes 5.58 Gbits/sec The /27 net approach worked fine and straight forward. Its a simple extension of the /31 approach described here http://lartc.org/howto/lartc.loadshare.html Just the default routes that come up when configuring the IP addresses. I divided a /24 net into 8 chunks - one for the boot configuration (PXE, nfsroot...) - 6 for each parallel link subnets - one for the teql subnet root@blade-001:~# ip addr .... 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 1000 link/ether 00:22:64:06:9b:7a brd ff:ff:ff:ff:ff:ff inet 192.168.130.1/27 brd 192.168.130.31 scope global eth0 inet 192.168.130.33/27 scope global eth0:0 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 1000 link/ether 00:22:64:06:db:4c brd ff:ff:ff:ff:ff:ff inet 192.168.130.65/27 scope global eth1 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 1000 link/ether 00:21:5a:af:8e:40 brd ff:ff:ff:ff:ff:ff inet 192.168.130.97/27 scope global eth2 .... 7: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 1000 link/ether 00:21:5a:af:8e:43 brd ff:ff:ff:ff:ff:ff inet 192.168.130.193/27 scope global eth5 8: teql0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UNKNOWN qlen 100 link/void inet 192.168.130.225/27 scope global teql0 valid_lft forever preferred_lft forever (boring lines deleted) Jumbo frames (mtu = 9000) are essential, they incrase throughput from ~ 3 GBit (aka 50 % of theoretical maximum) to > 5.5 (aka > 90 %) So far so good: I can combine the performance of layer 2 aggregation (bonding) with layer 3 control of whats going on, getting clamps on nasty switch behaviour. At least, so I hoped. ==== QIRKS ==== But when it gets to transfer between the blade nodes and the external gateway, things get funny again. This is how the network now looks like: The gateway aka cruncher is connected one-by-one Gbit cable to each of the six VC swithces in the blade enclosure. For each VC bay (matching the physical /27 subnets) I configured a separete vlan to convince VC to treat the uplinks as parallel, not as failover. +-------------eth4---gateway(aka cruncher) | +-------------eth5---gateway(aka cruncher) | | +-------------eth6---gateway(aka cruncher) | | | +-------------eth7---gateway(aka cruncher) | | | | +-------------eth8---gateway(aka cruncher) | | | | | +-------------eth9---gateway(aka cruncher) +-+-+-+-+-+----blade-001 +-+-+-+-+-+----blade-002 +-+-+-+-+-+----blade-003 Straight implementation of above scheme on the gateway yields not more than ~ 2 GBit. So, some aggregation happens, but far from the 6 GBit maximum. ifconfig and wireshark show traffic coming equally over all 6 lines. But with an awful lots of retransmits. Well, maybe that wireshark gets confused by teql and fails matching packets since they go over different interfaces, but thats another issue, not primary here. After lots of googling, I pinned the symptom down to this issue: # for i in `seq 2 9`; do ethtool -S eth$i | grep rx_missed_errors ; done rx_missed_errors: 0 rx_missed_errors: 0 rx_missed_errors: 0 rx_missed_errors: 0 rx_missed_errors: 29159 rx_missed_errors: 28619 rx_missed_errors: 9263 rx_missed_errors: 23306 from http://osdir.com/ml/linux.drivers.e1000.devel/2007-11/msg00133.html ---<quote>-------------------- you are running out of bus bandwidth (which is why increasing descriptors doesn't help). rx_missed_errors occur when you run out of fifo on the adapter itself, indicating the bus can't be attained for long enough to keep the data rate up. ---</quote>-------------------- eth2 .. eth5 and eth6 ... eth9 are a quad port 82571EB Gigabit Ethernet each. extracted from lspci I find 0c:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter 07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06) Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port Gigabit Server Adapter ' +-0a.0-[05-08]----00.0-[06-08]--+-02.0-[07]--+-00.0 ' | | \-00.1 ' | \-04.0-[08]--+-00.0 ' | \-00.1 ' +-0b.0-[09]--+-00.0 | \-00.1 ' +-0d.0-[0a-0d]----00.0-[0b-0d]--+-00.0-[0c]--+-00.0 ' | | \-00.1 ' | \-01.0-[0d]--+-00.0 ' | \-00.1 so both adaptors have the same chipset, same driver, similar bus connectivity and announce identical PCI bus bandwith: 'LnkSta: Speed 2.5GT/s, Width x4' believing http://en.wikipedia.org/wiki/PCI_Express this comes out to 8 Gbit /s, which should basically suffice, I think. And on the "good" NIC, it actually does, obviously: To check, and to increase safety head, I switched 2 cables from the "buggy" NIC to the "healthy" one - and kept link konfig matching, of course. and - alas - we get up from ~2 GBit to > 3 GBit. Still thousands of rx_missed_errors in the "bad" NIC, which has only to work for 2 GBit connections now, and still zero of rx_missed_errors for the "good" NIC , which carries 4 GBit active now. Further googling and tweaking memory limits in /proc/sys/net/ipv4/tcp_*mem and /proc/sys/net/core/*mem* showed no difference. What helped, was to incrase the "TCP window size" on the iperf server side from "TCP window size: 85.3 KByte (default)" to a value between 512K and 2 M root@cruncher:/cluster/etc/network# iperf -s -w1M ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte) ------------------------------------------------------------ [ 4] local 192.168.130.254 port 5001 connected with 192.168.130.226 port 33775 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 5.06 GBytes 4.35 Gbits/sec Now we are over 70 % of theoretical maximum. However, neither do I really understand it, nor do I know how to transfer this window size setting to other applications. I think the TCP window size is just a workaround for underlying problems, because - still lots of rx_missed_errors for eth6 and eth7 - the blade-blade connection with 5.6 GBit works even better without any tweaking with small TCP window size: root@blade-001:~# iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) ------------------------------------------------------------ [ 4] local 192.168.130.225 port 5001 connected with 192.168.130.226 port 49581 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 6.49 GBytes 5.58 Gbits/sec Possible causes on my list - firmware problem (NICs, Mainboard) - hardware problem (NICs, Mainboard) -some realy weird hidden tweak paramater - conceptual limitation of hardware design -some realy weird hidden tweak paramater - driver problem - kernel / scheduling issue / IRQ / race...whatever? - still the nasty VC blade switch? - any more? The gateway mainboard is a SABERTOOTH 990FX R2.0 [AMD/ATI] RD890 PCI to PCI bridge (external gfx1 port A) - consumer grade, but quite recent - Gateway CPU is a AMD FX-8320 8 Core Linux cruncher 3.19.0 #1 SMP Tue Mar 3 19:05:04 CET 2015 x86_64 GNU/Linux The blade nodes are HP blades 460c G1 chipset Intel 5000 - enterprise grade, but quite some years now, I suppose - CPU 2 x Xeon E5430 quad Linux blade-002.crunchnet 3.16.0-0.bpo.4-amd64 #1 SMP Debian 3.16.7-ckt4-3~bpo70+1 (2015-02-12) x86_64 GNU/Linux Testing memory bandwith with mbw (as a first measure of system bus thruput), the Gateway outperforms the blades by a factor of two root@blade-002:~# mbw -n1 1000 AVG Method: MEMCPY Elapsed: 0.61679 MiB: 1000.00000 Copy: 1621.300 MiB/s AVG Method: DUMB Elapsed: 0.51892 MiB: 1000.00000 Copy: 1927.068 MiB/s AVG Method: MCBLOCK Elapsed: 0.39211 MiB: 1000.00000 Copy: 2550.311 MiB/s root@cruncher...# mbw -n1 1000 AVG Method: MEMCPY Elapsed: 0.27301 MiB: 1000.00000 Copy: 3662.923 MiB/s AVG Method: DUMB Elapsed: 0.19693 MiB: 1000.00000 Copy: 5077.972 MiB/s AVG Method: MCBLOCK Elapsed: 0.19287 MiB: 1000.00000 Copy: 5184.947 MiB/s So, conceptually, I see no reason why from two nearly identical quad-GB adapters, one should fail so badly on the faster system. again compared lspci line by line and found a tiny difference: Hewlett-Packard Company NC364T.... (the 'bad') Region 0: Memory at fc400000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fc300000 (32-bit, non-prefetchable) [size=512K] Region 2: I/O ports at 8000 [size=32] Intel Corporation PRO/1000 PT ...('the good') Region 0: Memory at fc5a0000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fc580000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at 5020 [size=32] so the "Region 2" memory is 4x larger in the 'bad' NIC. Any clue whether this may be related? Just an uneducated guess: If it were some kind of pointer fifo into some buffer memory, the larger one might run out of referred buffer, while the smaller does not???? How to proceed from "Guess" to "Know" to "Cure"? Anybody any idea? ====================== just to exclude the idiots error, before hitting the send button: I switched the cables to the faulty NIC (after now only two were left) and rate on the teql link went down from > 2 Gbit to ~ 340 Kbits/sec So, yes, cabling was right before, and yes, the scheme provides some fault tolerance, albeit with severe hits in performance. Wolfgang Rosner -- To unsubscribe from this list: send the line "unsubscribe lartc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html