Hello, Jay, thanks for your prompt answer. > You may also be able to tweak some interface paramaters and > improve things; I'll point you at this discussion from a few years ago: > > http://lists.openwall.net/netdev/2011/08/25/88 OK. I tried to tweak the rx-usecs as given there, but saw no reproducible difference. My systems default was 18, and I tried both 6 and 45. Regarding the TSO et al issue, I think this topic entered already the default setting in recent systems: root@blade-001:~# ethtool -k eth0 | grep offload tcp-segmentation-offload: on udp-fragmentation-offload: off [fixed] generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on l2-fwd-offload: off [fixed] However, following the hints in this article, I encountered the most obvious way to tweak throghput: jumbo packages. Setting mtu = 9000 on both sides, I get 5200 MBit for netperf throughput, which is 86 % of theoretical maximum . (was 4100 with mtu=1500 before) nfs transfer is at 3.4 GBit/s (was 2,7 GBit with mtu=1500) I had one encounter with 4.2 GBit, but cannot reproduce this nfs options for the crossmunted /run/shm ramdisks are shown by mount as 192.168.130.2:/shm on /cluster/shm/node002 type nfs4 (rw,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=1,sec=sys,clientaddr=192.168.130.3,local_lock=none,addr=192.168.130.2) What I have configured in the automounter script: $nfs_opts = "-fstype=nfs4,sec=sys,async,noatime,fg,soft,intr,retrans=1,retry=0" ; So - I haven't conifugred the rsize/wsize. As RTFM says, client and server agree on the highest possible values, and try to get to 1 MByte Anyway, I get of topic, as this is not a nfs mailing list. > % netstat -s|grep -i reord > Detected reordering 20 times using time stamp > > or you can hunt for the raw values in /proc/net/netstat or use > nstat to print them: Hm. I see figures, but how to put meaning onto them? before: root@blade-002:~# netstat -s|grep -i reord Detected reordering 1 times using FACK root@blade-003:~# netstat -s|grep -i reord Detected reordering 1 times using FACK Detected reordering 1 times using SACK now doing some work: copying a 4 GB file over nfs between ram disks: (from blade-003 to blade-002) root@blade-002:~# time cp /cluster/shm/node003/random.002 /run/shm/random.002 real 0m8.701s user 0m0.000s sys 0m4.816s after: root@blade-002:~# netstat -s|grep -i reord Detected reordering 1 times using FACK root@blade-003:~# netstat -s|grep -i reord Detected reordering 2 times using FACK Detected reordering 234 times using SACK Detected reordering 7 times using time stamp wouldn't I have expected the reordering problems on the receivers side? But I see it on the sender - I double and triple checked this.... Just in case you have an eye for peculiarities I do not see: sender side root@blade-003:~# nstat #kernel IpInReceives 721022 0.0 IpInDelivers 721022 0.0 IpOutRequests 550631 0.0 TcpActiveOpens 1 0.0 TcpPassiveOpens 1 0.0 TcpInSegs 720990 0.0 TcpOutSegs 2177539 0.0 TcpRetransSegs 4566 0.0 UdpInDatagrams 32 0.0 UdpOutDatagrams 2 0.0 TcpExtDelayedACKs 33 0.0 TcpExtTCPPrequeued 1 0.0 TcpExtTCPHPHits 2066 0.0 TcpExtTCPPureAcks 623665 0.0 TcpExtTCPHPAcks 38636 0.0 TcpExtTCPSackRecovery 423 0.0 TcpExtTCPFACKReorder 1 0.0 TcpExtTCPSACKReorder 233 0.0 TcpExtTCPTSReorder 7 0.0 TcpExtTCPFullUndo 19 0.0 TcpExtTCPPartialUndo 18 0.0 TcpExtTCPDSACKUndo 336 0.0 TcpExtTCPFastRetrans 1642 0.0 TcpExtTCPForwardRetrans 2924 0.0 TcpExtTCPDSACKRecv 3942 0.0 TcpExtTCPDSACKOfoRecv 6 0.0 TcpExtTCPDSACKIgnoredOld 15 0.0 TcpExtTCPDSACKIgnoredNoUndo 177 0.0 TcpExtTCPSackShifted 62709 0.0 TcpExtTCPSackMerged 261712 0.0 TcpExtTCPSackShiftFallback 404717 0.0 TcpExtTCPRetransFail 43 0.0 TcpExtTCPRcvCoalesce 536 0.0 TcpExtTCPOFOQueue 3 0.0 TcpExtTCPSpuriousRtxHostQueues 605 0.0 TcpExtTCPAutoCorking 58763 0.0 TcpExtTCPOrigDataSent 2176191 0.0 IpExtInBcastPkts 30 0.0 IpExtInOctets 53708303 0.0 IpExtOutOctets 3181655278 0.0 IpExtInBcastOctets 2280 0.0 IpExtInNoECTPkts 721719 0.0 receiver side: root@blade-002:~# nstat #kernel IpInReceives 750213 0.0 IpInAddrErrors 2 0.0 IpInDelivers 750211 0.0 IpOutRequests 751510 0.0 IcmpInErrors 246 0.0 IcmpInCsumErrors 112 0.0 IcmpInTimeExcds 224 0.0 IcmpInEchoReps 3 0.0 IcmpInTimestamps 19 0.0 IcmpOutErrors 246 0.0 IcmpOutTimeExcds 224 0.0 IcmpOutEchoReps 19 0.0 IcmpOutTimestamps 3 0.0 IcmpMsgInType0 19 0.0 IcmpMsgInType3 224 0.0 IcmpMsgInType8 3 0.0 IcmpMsgOutType0 3 0.0 IcmpMsgOutType3 224 0.0 IcmpMsgOutType8 19 0.0 TcpActiveOpens 118 0.0 TcpPassiveOpens 10 0.0 TcpAttemptFails 112 0.0 TcpInSegs 748966 0.0 TcpOutSegs 751036 0.0 TcpRetransSegs 129 0.0 TcpOutRsts 2 0.0 UdpInDatagrams 871 0.0 UdpOutDatagrams 289 0.0 Ip6OutRequests 10 0.0 Ip6OutMcastPkts 16 0.0 Ip6OutOctets 688 0.0 Ip6OutMcastOctets 1144 0.0 Icmp6OutMsgs 10 0.0 Icmp6OutRouterSolicits 3 0.0 Icmp6OutNeighborSolicits 1 0.0 Icmp6OutMLDv2Reports 6 0.0 Icmp6OutType133 3 0.0 Icmp6OutType135 1 0.0 Icmp6OutType143 6 0.0 TcpExtPruneCalled 3 0.0 TcpExtTW 3 0.0 TcpExtDelayedACKs 372 0.0 TcpExtDelayedACKLocked 2 0.0 TcpExtDelayedACKLost 3926 0.0 TcpExtTCPPrequeued 2 0.0 TcpExtTCPHPHits 42851 0.0 TcpExtTCPPureAcks 1056 0.0 TcpExtTCPHPAcks 10889 0.0 TcpExtTCPSackRecovery 4 0.0 TcpExtTCPFACKReorder 1 0.0 TcpExtTCPDSACKUndo 2 0.0 TcpExtTCPFastRetrans 11 0.0 TcpExtTCPForwardRetrans 3 0.0 TcpExtTCPTimeouts 113 0.0 TcpExtTCPLossProbes 2 0.0 TcpExtTCPLossProbeRecovery 1 0.0 TcpExtTCPRcvCollapsed 543 0.0 TcpExtTCPDSACKOldSent 3951 0.0 TcpExtTCPDSACKOfoSent 6 0.0 TcpExtTCPDSACKRecv 13 0.0 TcpExtTCPDSACKIgnoredNoUndo 1 0.0 TcpExtTCPSackShifted 7 0.0 TcpExtTCPSackMerged 23 0.0 TcpExtTCPSackShiftFallback 72 0.0 TcpExtTCPBacklogDrop 204 0.0 TcpExtTCPRcvCoalesce 44267 0.0 TcpExtTCPOFOQueue 487750 0.0 TcpExtTCPOFOMerge 6 0.0 TcpExtTCPSpuriousRtxHostQueues 112 0.0 TcpExtTCPAutoCorking 1424 0.0 TcpExtTCPWantZeroWindowAdv 45 0.0 TcpExtTCPSynRetrans 112 0.0 TcpExtTCPOrigDataSent 16466 0.0 IpExtInBcastPkts 710 0.0 IpExtInOctets 3252256888 0.0 IpExtOutOctets 57783538 0.0 IpExtInBcastOctets 75470 0.0 IpExtInNoECTPkts 2236348 0.0 Anyway, I could live with this figures which I get between bonding interfaces configured with balance-rr bonding. However, when I switch over to the gateway, which is connected by a 802.3ad bonding policy link, performance sucks: from rr to 802.3ad root@cruncher:/cluster/etc/scripts/available# time cp /cluster/shm/node003/random.002 /run/shm/ real 1m20.708s user 0m0.000s sys 0m5.812s => 37 MByte / s = 300 GBit/s from 802.3ad to rr root@blade-002:~# time cp /cluster/shm/node000/random.002 /run/shm/random.002 real 0m26.747s user 0m0.008s sys 0m4.256s => 111 MByte / s = 888 GBit/s > >I tried layer 2 bonding as described here (... searching for a ) >> all-linux, maybe layer 3 alternative, So maybe I'd leave the rr in place for peer-to-peer connections between the blades and just have a layer-3-teql -like thing to the gateway? hm. but can this work? balance-rr bonding is syncing all MAC on the bond slaves. So I'm afraid there is no longer a chance to mix it with assigning indidual IP's to the slave interfaces, right? But when all Interfaces have the same MAC, distribution is left to the switch, which all the opacity problems I encountered. So either I go all-layer-2 or all-layer-3, right? > That text in the bonding documentation is fairly old, and (...) > It doesn't work well today, if for no other reason than > interrupt coalescing and NAPI on the receiver will induce serious out of > order delivery, and turning that off is not really an option. well, as my figures above tell my, It's not that bad, as long as it can be configured undisturbed on both sides and matches the switch topology. > >- How does the routing look like if I have 17 hosts connected by 6 > > interfaces each? As long as this question is not worked out, I have no chance to test teql on my system, I'm afraid. > >Current best setting is now having the blades on balancing-rr and the > > gateway connected by 8 parallel Gbit-links to one single VC-device and > > using LACP / 802.3ad on this. > > If you're testing your single stream throughput through this > LACP aggregation, you'll be limited by the throughput of one member of > that aggregation, as LACP will not stripe traffic. I know. That's the reason while I would like to do round robin. What can I expect from teql as compared to rr-bonding and to LACP-bonding? > Another issue is that, even if you round-robin from the host's > bond, if traffic has to transit through a switch aggregation (channel > group), it will rebalance the traffic on egress, and most likely funnel > it all back through a single switch port. Thats obvioulsy what happens in blade <-> gateway connections due to the asymmetric connection In blade <-> blade peering, it works fine, as I wrote. I Try to draw an ascii image of the topology: +-+-+-+-+-+----blade-001 +-+-+-+-+-+----blade-002 +-+-+-+-+-+----blade-003 +-+-+-+-+-+----blade-004 +-+-+-+-+-+----blade-005 +-+-+-+-+-+----blade-006 +-+-+-+-+-+----blade-007 +-+-+-+-+-+----blade-008 +-+-+-+-+-+----blade-009 +-+-+-+-+-+----blade-010 +-+-+-+-+-+----blade-011 +-+-+-+-+-+----blade-012 +-+-+-+-+-+----blade-013 +-+-+-+-+-+----blade-014 +-+-+-+-+-+----blade-015 +-+-+-+-+-+----blade-016 +-------------eth2---gateway(aka cruncher) +-------------eth3---gateway(aka cruncher) +-------------eth4---gateway(aka cruncher) +-------------eth5---gateway(aka cruncher) +-------------eth6---gateway(aka cruncher) +-------------eth7---gateway(aka cruncher) +-------------eth8---gateway(aka cruncher) +-------------eth9---gateway(aka cruncher) Each + column is a VC-switching module Blades have eth0 ... eth5 connected in the shown hadwired matrix way. There are additional stacking links between the VC-switching modules not shown here. But it looks like the shortest path algorithm keeps rr neatly ordered between blades. But when I distribute the gateway connections equally to all switch modules, only one of them is "link-active", the others are shown as "link-failover". Only when I connect all of them to a single VC and configure them using LACP, they are used in parallel. But not matching the round robing mode, right as you mention. > one-switch-per-interface sort of arrangement that blade environments > impose, and never really got bonding to work well for load balancing in > those type of environments. > > One issue for production use was that if a switch port fails on > one of the switches, the other peers sending traffic into that switch Well, I think there are different goals in Hig-PERFORMACE-clustering as opposed to High-AVAILABILITY-clustering. Most of "production use" referst to web server or enterprise system stuff, which are basically HA, I'd say. And thats what those boxes are optimsied for - see the link-failover issue above. Setting up a new HPC-cluster with a bunch of dollar notes, I presumably would simply go for infiniband instead of ethernet (or at least 10 GB ethernet), but there's no budget way for that. I simly try to get best out of the stuff that I can pick up at the lower end of the food chain ;-) HM, so what?? I'll try to read the HP docu stuff whether I can get rid of the failover behaviour. If I only could just rip off all the stacking links and let the VC modules each behave as a "good old cheap and silly" switch.... Wolfgang Rosner -- To unsubscribe from this list: send the line "unsubscribe lartc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html