Re: TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks

Wolfgang Rosner <wrosner@xxxxxxxxx> · Mon, 16 Mar 2015 10:10:25 +0100

Hello,

the good new in short: IT WORKS

I get 5.58 GBit / sec over 6 x 1 GBit between my blade nodes,
using layer 3 teql link aggregation:

root@blade-002:~# iperf -c 192.168.130.225
........
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  6.49 GBytes  5.58 Gbits/sec

The /27 net approach worked fine and straight forward.
Its a simple extension of the /31 approach described here
http://lartc.org/howto/lartc.loadshare.html

Just the default routes that come up when configuring the IP addresses.
I divided a /24 net into 8 chunks
- one for the boot configuration (PXE, nfsroot...)
- 6 for each parallel link subnets
- one for the teql subnet

root@blade-001:~# ip addr
....
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 
1000
    link/ether 00:22:64:06:9b:7a brd ff:ff:ff:ff:ff:ff
    inet 192.168.130.1/27 brd 192.168.130.31 scope global eth0
    inet 192.168.130.33/27 scope global eth0:0
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 
1000
    link/ether 00:22:64:06:db:4c brd ff:ff:ff:ff:ff:ff
    inet 192.168.130.65/27 scope global eth1
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 
1000
    link/ether 00:21:5a:af:8e:40 brd ff:ff:ff:ff:ff:ff
    inet 192.168.130.97/27 scope global eth2
....
7: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc teql0 state UP qlen 
1000
    link/ether 00:21:5a:af:8e:43 brd ff:ff:ff:ff:ff:ff
    inet 192.168.130.193/27 scope global eth5
8: teql0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state 
UNKNOWN qlen 100
    link/void
    inet 192.168.130.225/27 scope global teql0
       valid_lft forever preferred_lft forever

(boring lines deleted)

Jumbo frames (mtu = 9000) are essential, they incrase throughput from ~ 3 GBit 
(aka 50 % of theoretical maximum) to > 5.5 (aka > 90 %)

So far so good:
I can combine the performance of layer 2 aggregation (bonding) with layer 3 
control of whats going on, getting clamps on nasty switch behaviour.
At least, so I hoped.

==== QIRKS ====

But when it gets to transfer between the blade nodes and the external gateway, 
things get funny again.

This is how the network now looks like:
The gateway aka cruncher is connected one-by-one Gbit cable to each of the six 
VC swithces in the blade enclosure. For each VC bay (matching  the 
physical /27 subnets) I configured a separete vlan to convince VC to treat 
the uplinks as parallel, not as failover.

+-------------eth4---gateway(aka cruncher)
| +-------------eth5---gateway(aka cruncher)
| | +-------------eth6---gateway(aka cruncher)
| | | +-------------eth7---gateway(aka cruncher)
| | | | +-------------eth8---gateway(aka cruncher)
| | | | | +-------------eth9---gateway(aka cruncher)
+-+-+-+-+-+----blade-001
+-+-+-+-+-+----blade-002
+-+-+-+-+-+----blade-003

Straight implementation of above scheme on the gateway yields not more than ~ 
2 GBit. 
So, some aggregation happens, but far from the 6 GBit maximum.

ifconfig and wireshark show traffic coming equally over all 6 lines.
But with an awful lots of retransmits.
Well, maybe that wireshark gets confused by teql and fails matching packets 
since they go over different interfaces, but thats another issue, not primary 
here.

After lots of googling, I pinned the symptom down to this issue:

# for i in `seq 2 9`; do ethtool -S eth$i | grep rx_missed_errors ; done
     rx_missed_errors: 0
     rx_missed_errors: 0
     rx_missed_errors: 0
     rx_missed_errors: 0
     rx_missed_errors: 29159 
     rx_missed_errors: 28619
     rx_missed_errors: 9263
     rx_missed_errors: 23306

from
http://osdir.com/ml/linux.drivers.e1000.devel/2007-11/msg00133.html

---<quote>--------------------
you are running out of bus bandwidth (which is why increasing
descriptors doesn't help). rx_missed_errors occur when you run out of
fifo on the adapter itself, indicating the bus can't be attained for
long enough to keep the data rate up.
---</quote>--------------------

eth2 .. eth5 and eth6 ... eth9 are a quad port 82571EB Gigabit Ethernet each.

extracted from lspci I find

0c:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (rev 06)
        Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter

07:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet 
Controller (Copper) (rev 06)
        Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port 
Gigabit Server Adapter

'   +-0a.0-[05-08]----00.0-[06-08]--+-02.0-[07]--+-00.0
'   |                               |            \-00.1
'   |                               \-04.0-[08]--+-00.0
'   |                                            \-00.1
'   +-0b.0-[09]--+-00.0           |            \-00.1
'   +-0d.0-[0a-0d]----00.0-[0b-0d]--+-00.0-[0c]--+-00.0
'   |                               |            \-00.1
'   |                               \-01.0-[0d]--+-00.0
'   |                                            \-00.1

so both adaptors have the same chipset, same driver, similar bus connectivity 
and announce identical PCI bus bandwith:
	'LnkSta: Speed 2.5GT/s, Width x4'

believing http://en.wikipedia.org/wiki/PCI_Express
this comes out to 8 Gbit /s, which should basically suffice, I think.
And on the "good" NIC, it actually does, obviously:

To check, and to increase safety head, I switched 2 cables from the "buggy" 
NIC to the "healthy" one - and kept link konfig matching, of course. 

and - alas - we get up from ~2 GBit to > 3 GBit.
Still thousands of  rx_missed_errors
in the "bad" NIC,  which has only to work for 2 GBit connections now, and 
still zero  of  rx_missed_errors for the "good" NIC , which carries 4 GBit 
active now.

Further googling and tweaking memory limits in
	/proc/sys/net/ipv4/tcp_*mem
and 
 	/proc/sys/net/core/*mem*
showed no difference.

What helped, was to incrase the "TCP window size" on the iperf server side 
from 
	"TCP window size: 85.3 KByte (default)"
to a value between 512K and 2 M

root@cruncher:/cluster/etc/network# iperf -s -w1M
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 2.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  4] local 192.168.130.254 port 5001 connected with 192.168.130.226 port 
33775
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  5.06 GBytes  4.35 Gbits/sec

Now we are over 70 % of theoretical maximum.
However, neither  do  I really understand it, nor do I know how to transfer 
this window size setting to other applications.

I think the TCP window size is just a workaround for underlying problems, 
because
- still lots of  rx_missed_errors for eth6 and eth7
- the blade-blade connection with 5.6 GBit works even better without any 
tweaking with small TCP window size:

root@blade-001:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.130.225 port 5001 connected with 192.168.130.226 port 
49581
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  6.49 GBytes  5.58 Gbits/sec

Possible causes on my list

- firmware problem (NICs, Mainboard)
- hardware problem (NICs, Mainboard)
 -some realy weird hidden tweak paramater
- conceptual limitation of hardware design
 -some realy weird hidden tweak paramater
- driver problem
- kernel / scheduling issue / IRQ / race...whatever?
- still the nasty VC blade switch?
-  any more?

The gateway mainboard is a SABERTOOTH 990FX R2.0
[AMD/ATI] RD890 PCI to PCI bridge (external gfx1 port A)
- consumer grade, but quite recent -
Gateway CPU is a AMD FX-8320 8 Core
Linux cruncher 3.19.0 #1 SMP Tue Mar 3 19:05:04 CET 2015 x86_64 GNU/Linux

The blade nodes are HP blades 460c G1
chipset Intel 5000
- enterprise grade, but quite some years now, I suppose -
CPU 2 x Xeon E5430 quad
Linux blade-002.crunchnet 3.16.0-0.bpo.4-amd64 #1 SMP Debian 
3.16.7-ckt4-3~bpo70+1 (2015-02-12) x86_64 GNU/Linux

Testing memory bandwith with mbw (as a first measure of system bus thruput), 
the Gateway outperforms the blades by a factor of two

root@blade-002:~# mbw -n1 1000
AVG     Method: MEMCPY  Elapsed: 0.61679        MiB: 1000.00000 Copy: 1621.300 
MiB/s
AVG     Method: DUMB    Elapsed: 0.51892        MiB: 1000.00000 Copy: 1927.068 
MiB/s
AVG     Method: MCBLOCK Elapsed: 0.39211        MiB: 1000.00000 Copy: 2550.311 
MiB/s

root@cruncher...#  mbw -n1 1000
AVG     Method: MEMCPY  Elapsed: 0.27301        MiB: 1000.00000 Copy: 3662.923 
MiB/s
AVG     Method: DUMB    Elapsed: 0.19693        MiB: 1000.00000 Copy: 5077.972 
MiB/s
AVG     Method: MCBLOCK Elapsed: 0.19287        MiB: 1000.00000 Copy: 5184.947 
MiB/s

So, conceptually, I see no reason why from two nearly identical quad-GB 
adapters, one should fail so badly on the faster system.

again compared lspci line by line and found a tiny difference:

Hewlett-Packard Company NC364T.... (the 'bad')
        Region 0: Memory at fc400000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at fc300000 (32-bit, non-prefetchable) [size=512K]
        Region 2: I/O ports at 8000 [size=32]

Intel Corporation PRO/1000 PT ...('the good')
        Region 0: Memory at fc5a0000 (32-bit, non-prefetchable) [size=128K]
        Region 1: Memory at fc580000 (32-bit, non-prefetchable) [size=128K]
        Region 2: I/O ports at 5020 [size=32]

so the "Region 2" memory is 4x larger in the 'bad' NIC.
Any clue whether this may be related? 
Just an uneducated guess:
If it were some kind of pointer fifo into some buffer memory, the larger one 
might run out of referred buffer, while the smaller does not????

How to proceed from "Guess" to "Know" to "Cure"?

Anybody any idea?

======================
just to exclude the idiots error, before hitting the send button:
I switched the cables to the faulty NIC (after now only two were left)
and rate on the teql link went down from > 2 Gbit to ~ 340 Kbits/sec

So, yes, cabling was right before,
and yes, the scheme provides some fault tolerance, albeit with severe hits in 
performance.

Wolfgang Rosner

--
To unsubscribe from this list: send the line "unsubscribe lartc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TEQL for bonding Multi Gbit Ethernet in a cluster - WORKS with quirks

Linux Advanced Routing and Traffic Control