Re: TEQL for bonding Multi Gbit Ethernet in a cluster?

Wolfgang Rosner <wrosner@xxxxxxxxx> · Sat, 14 Mar 2015 18:44:03 +0100

Hello, Jay, 

thanks for your prompt answer.

> 	You may also be able to tweak some interface paramaters and
> improve things; I'll point you at this discussion from a few years ago:
>
> http://lists.openwall.net/netdev/2011/08/25/88
OK.
I tried to tweak the rx-usecs as given there, but saw no reproducible 
difference. My systems default was 18, and I tried both 6 and 45.

Regarding the TSO et al issue, I think this topic entered already the default 
setting in recent systems:

root@blade-001:~# ethtool -k eth0 | grep offload
tcp-segmentation-offload: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
l2-fwd-offload: off [fixed]

However, following the hints in this article, I encountered the most obvious 
way to tweak throghput:
	 jumbo packages.

Setting mtu = 9000 on both sides, I get 
5200 MBit for netperf throughput, which is 86 % of theoretical maximum .
(was 4100 with mtu=1500 before)

nfs transfer is at 3.4 GBit/s (was 2,7 GBit with mtu=1500)
I had one encounter with 4.2 GBit, but cannot reproduce this

nfs options for the crossmunted /run/shm ramdisks are shown by mount as 

192.168.130.2:/shm on /cluster/shm/node002 type nfs4 
(rw,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=1,sec=sys,clientaddr=192.168.130.3,local_lock=none,addr=192.168.130.2)

What I have configured in the automounter script:

$nfs_opts 
= "-fstype=nfs4,sec=sys,async,noatime,fg,soft,intr,retrans=1,retry=0" ;

So - I haven't conifugred the rsize/wsize.
As RTFM says, client and server agree on the highest possible values, and try 
to get to 1 MByte

Anyway, I get of topic, as this is not a nfs mailing list.

> % netstat -s|grep -i reord
>     Detected reordering 20 times using time stamp
>
> 	or you can hunt for the raw values in /proc/net/netstat or use
> nstat to print them:

Hm. I see figures, but how to put meaning onto them?

before:

root@blade-002:~# netstat -s|grep -i reord
    Detected reordering 1 times using FACK

root@blade-003:~# netstat -s|grep -i reord
    Detected reordering 1 times using FACK
    Detected reordering 1 times using SACK

now doing some work:
	copying a 4 GB file over nfs between ram disks:
	(from blade-003 to blade-002)
root@blade-002:~# time cp /cluster/shm/node003/random.002 /run/shm/random.002
	real    0m8.701s
	user    0m0.000s
	sys     0m4.816s

after:

root@blade-002:~# netstat -s|grep -i reord
    Detected reordering 1 times using FACK

root@blade-003:~# netstat -s|grep -i reord
    Detected reordering 2 times using FACK
    Detected reordering 234 times using SACK
    Detected reordering 7 times using time stamp

wouldn't I have expected the reordering problems on the receivers side?
But I see it on the sender - I double and triple checked this....

Just in case you have an eye for peculiarities I do not see:

sender side

root@blade-003:~# nstat
#kernel
IpInReceives                    721022             0.0
IpInDelivers                    721022             0.0
IpOutRequests                   550631             0.0
TcpActiveOpens                  1                  0.0
TcpPassiveOpens                 1                  0.0
TcpInSegs                       720990             0.0
TcpOutSegs                      2177539            0.0
TcpRetransSegs                  4566               0.0
UdpInDatagrams                  32                 0.0
UdpOutDatagrams                 2                  0.0
TcpExtDelayedACKs               33                 0.0
TcpExtTCPPrequeued              1                  0.0
TcpExtTCPHPHits                 2066               0.0
TcpExtTCPPureAcks               623665             0.0
TcpExtTCPHPAcks                 38636              0.0
TcpExtTCPSackRecovery           423                0.0
TcpExtTCPFACKReorder            1                  0.0
TcpExtTCPSACKReorder            233                0.0
TcpExtTCPTSReorder              7                  0.0
TcpExtTCPFullUndo               19                 0.0
TcpExtTCPPartialUndo            18                 0.0
TcpExtTCPDSACKUndo              336                0.0
TcpExtTCPFastRetrans            1642               0.0
TcpExtTCPForwardRetrans         2924               0.0
TcpExtTCPDSACKRecv              3942               0.0
TcpExtTCPDSACKOfoRecv           6                  0.0
TcpExtTCPDSACKIgnoredOld        15                 0.0
TcpExtTCPDSACKIgnoredNoUndo     177                0.0
TcpExtTCPSackShifted            62709              0.0
TcpExtTCPSackMerged             261712             0.0
TcpExtTCPSackShiftFallback      404717             0.0
TcpExtTCPRetransFail            43                 0.0
TcpExtTCPRcvCoalesce            536                0.0
TcpExtTCPOFOQueue               3                  0.0
TcpExtTCPSpuriousRtxHostQueues  605                0.0
TcpExtTCPAutoCorking            58763              0.0
TcpExtTCPOrigDataSent           2176191            0.0
IpExtInBcastPkts                30                 0.0
IpExtInOctets                   53708303           0.0
IpExtOutOctets                  3181655278         0.0
IpExtInBcastOctets              2280               0.0
IpExtInNoECTPkts                721719             0.0

receiver side:

root@blade-002:~# nstat
#kernel
IpInReceives                    750213             0.0
IpInAddrErrors                  2                  0.0
IpInDelivers                    750211             0.0
IpOutRequests                   751510             0.0
IcmpInErrors                    246                0.0
IcmpInCsumErrors                112                0.0
IcmpInTimeExcds                 224                0.0
IcmpInEchoReps                  3                  0.0
IcmpInTimestamps                19                 0.0
IcmpOutErrors                   246                0.0
IcmpOutTimeExcds                224                0.0
IcmpOutEchoReps                 19                 0.0
IcmpOutTimestamps               3                  0.0
IcmpMsgInType0                  19                 0.0
IcmpMsgInType3                  224                0.0
IcmpMsgInType8                  3                  0.0
IcmpMsgOutType0                 3                  0.0
IcmpMsgOutType3                 224                0.0
IcmpMsgOutType8                 19                 0.0
TcpActiveOpens                  118                0.0
TcpPassiveOpens                 10                 0.0
TcpAttemptFails                 112                0.0
TcpInSegs                       748966             0.0
TcpOutSegs                      751036             0.0
TcpRetransSegs                  129                0.0
TcpOutRsts                      2                  0.0
UdpInDatagrams                  871                0.0
UdpOutDatagrams                 289                0.0
Ip6OutRequests                  10                 0.0
Ip6OutMcastPkts                 16                 0.0
Ip6OutOctets                    688                0.0
Ip6OutMcastOctets               1144               0.0
Icmp6OutMsgs                    10                 0.0
Icmp6OutRouterSolicits          3                  0.0
Icmp6OutNeighborSolicits        1                  0.0
Icmp6OutMLDv2Reports            6                  0.0
Icmp6OutType133                 3                  0.0
Icmp6OutType135                 1                  0.0
Icmp6OutType143                 6                  0.0
TcpExtPruneCalled               3                  0.0
TcpExtTW                        3                  0.0
TcpExtDelayedACKs               372                0.0
TcpExtDelayedACKLocked          2                  0.0
TcpExtDelayedACKLost            3926               0.0
TcpExtTCPPrequeued              2                  0.0
TcpExtTCPHPHits                 42851              0.0
TcpExtTCPPureAcks               1056               0.0
TcpExtTCPHPAcks                 10889              0.0
TcpExtTCPSackRecovery           4                  0.0
TcpExtTCPFACKReorder            1                  0.0
TcpExtTCPDSACKUndo              2                  0.0
TcpExtTCPFastRetrans            11                 0.0
TcpExtTCPForwardRetrans         3                  0.0
TcpExtTCPTimeouts               113                0.0
TcpExtTCPLossProbes             2                  0.0
TcpExtTCPLossProbeRecovery      1                  0.0
TcpExtTCPRcvCollapsed           543                0.0
TcpExtTCPDSACKOldSent           3951               0.0
TcpExtTCPDSACKOfoSent           6                  0.0
TcpExtTCPDSACKRecv              13                 0.0
TcpExtTCPDSACKIgnoredNoUndo     1                  0.0
TcpExtTCPSackShifted            7                  0.0
TcpExtTCPSackMerged             23                 0.0
TcpExtTCPSackShiftFallback      72                 0.0
TcpExtTCPBacklogDrop            204                0.0
TcpExtTCPRcvCoalesce            44267              0.0
TcpExtTCPOFOQueue               487750             0.0
TcpExtTCPOFOMerge               6                  0.0
TcpExtTCPSpuriousRtxHostQueues  112                0.0
TcpExtTCPAutoCorking            1424               0.0
TcpExtTCPWantZeroWindowAdv      45                 0.0
TcpExtTCPSynRetrans             112                0.0
TcpExtTCPOrigDataSent           16466              0.0
IpExtInBcastPkts                710                0.0
IpExtInOctets                   3252256888         0.0
IpExtOutOctets                  57783538           0.0
IpExtInBcastOctets              75470              0.0
IpExtInNoECTPkts                2236348            0.0

Anyway, I could live with this figures which I get between bonding interfaces 
configured with balance-rr bonding.

However, when I switch over to the gateway, which is connected by a  802.3ad 
bonding policy link, performance sucks:

from rr to 802.3ad
root@cruncher:/cluster/etc/scripts/available#  time 
cp /cluster/shm/node003/random.002 /run/shm/
	real    1m20.708s
	user    0m0.000s
	sys     0m5.812s
=> 37 MByte / s = 300 GBit/s

from 802.3ad to rr
root@blade-002:~# time cp /cluster/shm/node000/random.002 /run/shm/random.002
	real    0m26.747s
	user    0m0.008s
	sys     0m4.256s
=> 111 MByte / s = 888 GBit/s

> >I tried layer 2 bonding as described here
(... searching for a )
>>  all-linux, maybe layer 3 alternative,

So maybe I'd leave the rr in place for peer-to-peer connections between the 
blades and just have a layer-3-teql -like thing to the gateway?

hm. but can this work?
balance-rr bonding is syncing all MAC on the bond slaves. So I'm afraid there 
is no longer a chance to mix it with assigning indidual IP's to the slave 
interfaces, right?
But when all Interfaces have the same MAC, distribution is left to the switch, 
which all the opacity problems I encountered.

So either I go all-layer-2 or all-layer-3, right?

> 	That text in the bonding documentation is fairly old, and
(...)
> 	It doesn't work well today, if for no other reason than
> interrupt coalescing and NAPI on the receiver will induce serious out of
> order delivery, and turning that off is not really an option.

well, as my figures above tell my, It's not that bad, as long as it can be 
configured undisturbed on both sides and matches the switch topology.

> >- How does the routing look like if I have 17 hosts connected by 6
> > interfaces each?

As long as this question is not worked out, I have no chance to test teql on 
my system, I'm afraid.

> >Current best setting is now having the blades on balancing-rr and the
> > gateway connected by 8 parallel Gbit-links to one single VC-device and
> > using LACP / 802.3ad on this.
>
> 	If you're testing your single stream throughput through this
> LACP aggregation, you'll be limited by the throughput of one member of
> that aggregation, as LACP will not stripe traffic.

I know.
That's the reason while I would like to do round robin.
What can I expect from teql as compared to rr-bonding and to LACP-bonding?

> 	Another issue is that, even if you round-robin from the host's
> bond, if traffic has to transit through a switch aggregation (channel
> group), it will rebalance the traffic on egress, and most likely funnel
> it all back through a single switch port.

Thats obvioulsy what happens in 
	blade <-> gateway connections
due to the asymmetric connection

In blade <-> blade peering, it works fine, as I wrote.

I Try to draw an ascii image of the topology:

+-+-+-+-+-+----blade-001
+-+-+-+-+-+----blade-002
+-+-+-+-+-+----blade-003
+-+-+-+-+-+----blade-004
+-+-+-+-+-+----blade-005
+-+-+-+-+-+----blade-006
+-+-+-+-+-+----blade-007
+-+-+-+-+-+----blade-008
+-+-+-+-+-+----blade-009
+-+-+-+-+-+----blade-010
+-+-+-+-+-+----blade-011
+-+-+-+-+-+----blade-012
+-+-+-+-+-+----blade-013
+-+-+-+-+-+----blade-014
+-+-+-+-+-+----blade-015
+-+-+-+-+-+----blade-016
+-------------eth2---gateway(aka cruncher)
+-------------eth3---gateway(aka cruncher)
+-------------eth4---gateway(aka cruncher)
+-------------eth5---gateway(aka cruncher)
+-------------eth6---gateway(aka cruncher)
+-------------eth7---gateway(aka cruncher)
+-------------eth8---gateway(aka cruncher)
+-------------eth9---gateway(aka cruncher)

Each + column is a VC-switching module
Blades have eth0 ... eth5 connected in the shown hadwired matrix way.

There are additional stacking links between the VC-switching modules not shown 
here.
But it looks like the shortest path algorithm keeps rr neatly ordered between 
blades.

But when I distribute the gateway connections equally to all switch modules, 
only one of them is "link-active", the others are shown as "link-failover".
Only when I connect all of them to a single VC and configure them using LACP, 
they are used in parallel. But not matching the round robing mode, right as 
you mention.

> one-switch-per-interface sort of arrangement that blade environments
> impose, and never really got bonding to work well for load balancing in
> those type of environments.
>
> 	One issue for production use was that if a switch port fails on
> one of the switches, the other peers sending traffic into that switch

Well, I think there are different goals in Hig-PERFORMACE-clustering as 
opposed to High-AVAILABILITY-clustering.

Most of "production use" referst to web server or enterprise system stuff, 
which are basically HA, I'd say. And thats what those boxes are optimsied 
for - see the link-failover issue above.

Setting up a new HPC-cluster with a bunch of dollar notes, I presumably would 
simply go for infiniband instead of ethernet (or at least 10 GB ethernet), 
but there's no budget way for that. I simly try to get best out of the stuff 
that I can pick up at the lower end of the food chain ;-)

HM, so what??
I'll try to read the HP docu stuff whether I can get rid of the failover 
behaviour. If I only could just rip off all the stacking links and let the VC 
modules each behave as a "good old cheap and silly" switch....

Wolfgang Rosner

--
To unsubscribe from this list: send the line "unsubscribe lartc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TEQL for bonding Multi Gbit Ethernet in a cluster?

Linux Advanced Routing and Traffic Control