Re: very low tcp thruput on LFN (using netem)

Bill Fink <billfink@xxxxxxxxxxxxxx> · Thu, 18 Dec 2008 20:41:19 -0500

On Thu, 18 Dec 2008, slashdev wrote:

> On Thu, Dec 18, 2008 at 12:28 AM, slashdev <slashdev@xxxxxxxxx> wrote:
> > On Wed, Dec 17, 2008 at 11:04 PM, Bill Fink <billfink@xxxxxxxxxxxxxx> wrote:
> >> On Wed, 17 Dec 2008, slashdev wrote:
> >>
> >>> Hi,
> >>>
> >>> I am experimenting with LFN using netem. My setup consists of two
> >>> linux (sles10/2.6.16/64-bit) boxes connected via GigE link (no switch
> >>> in between) to another linux box (sles10/64-bit) that acts has a
> >>> router. I add delay (and other link characterstics) via netem on the
> >>> router. Something like this:
> >>>
> >>> [ sender ] ---- gige ---- [   router   ] ----- gige ---- [ receiver ]
> >>> 192.1.1.2                1.1.1         1.2.1                 192.1.2.2
> >>>                                eth1         eth2
> >>>
> >>> sender and receiver have Broadcom/NetX II/Gige NICs. and the router
> >>> has e1000 NICs. and TSO is turned-on on all NICs.

To answer an earlier question, I am using the same basic type of setup,
with netem running on an intermediate router box between the sender and
receiver, just with 10-GigE rather than GigE.  One difference is I'm
testing with 9000-byte jumbo frames whereas I'm assuming you are testing
with standard 1500-byte Ethernet packets.  Another difference is I'm
running a newer 2.6.20.7 kernel while you're running a 2.6.16 kernel.

I just ran some tests using standard 1500-byte Ethernet packets and
found out it makes a big difference.  Using an 80 MB TCP window on an
emulated 80 ms RTT path:

[root@lang2 ~]# nuttcp -M1460 -w80m 192.168.89.15
  169.1862 MB /  24.58 sec =   57.7396 Mbps 2 %TX 1 %RX 2286 retrans 84.67 msRTT

Over 2000 TCP retransmissions and only 57 Mbps of TCP throughput.

Dropping down to an 8 MB TCP window:

[root@lang2 ~]# nuttcp -M1460 -w8m 192.168.89.15
  702.2605 MB /  10.07 sec =  585.1227 Mbps 15 %TX 12 %RX 0 retrans 84.05 msRTT

No TCP retransmissions and TCP throughput up to 585 Mbps.

Upping it to a 9 MB TCP window:

[root@lang2 ~]# nuttcp -M1460 -w9m 192.168.89.15
  966.6250 MB /  10.15 sec =  799.0430 Mbps 22 %TX 24 %RX 0 retrans 81.67 msRTT

Still no TCP retransmissions and TCP throughput up to neraly 800 Mbps.

But upping to a 10 MB TCP window:

[root@lang2 ~]# nuttcp -M1460 -w10m 192.168.89.15
  550.3750 MB /  10.17 sec =  453.8840 Mbps 9 %TX 6 %RX 37 retrans 84.53 msRTT

There are now some TCP retransmissions and TCP throughput drops to
about 450 Mbps.

Trying TCP autotuning:

[root@lang2 ~]# nuttcp -M1460 192.168.89.15
  163.5276 MB /  23.12 sec =   59.3327 Mbps 3 %TX 2 %RX 1582 retrans 81.26 msRTT

Over 1500 TCP retransmissions and TCP throughput drops to only 59 Mbps.

So for standard 1500-byte Ethernet packets, the TCP autotuning doesn't
appear to work too well with netem.  Perhaps there are other tc/netem
parameters that could be adjusted to improve things.

It's possible the way I have the CPU affinities set for the NICs in the
netem/router box could also have an effect.  I have the interrupts for
eth2 forced to CPU0 and the interrupts for eth3 forced to CPU1, to spread
the load, but for this case it might be best to force both interrupts to
CPU0, to get the benefits of CPU caching, assuming the CPU doesn't become
saturated.

> >>> I insert equal delay using netem on the router to both eth1 and eth2
> >>> interfaces. So to test with 100ms RTT, I add 50ms delay to outgoing
> >>> traffic on both eth1 and eth2.
> >>>
> >>> If I remove "netem" altogether, the RTT is around 0.1 ms (LAN times)
> >>> between sender and receiver and I get a healthy ~110MBps thruput.
> >>>   tcpdump trace: http://slashdev.googlepages.com/trace.nuttcp.0ms.gz
> >>>
> >>> If I add 20ms RTT, the tcp thruput is still around 95Mbps (good).
> >>>
> >>> If I add 100ms RTT, the tcp thruput crawls to 230KBps :-(
> >>>   tcpdump trace: http://slashdev.googlepages.com/trace.nuttcp.100ms.gz
> >>>
> >>> Note: the trace was collected from "eth2" on the router.
> >>>
> >>> I've tuned the tcp buffers (tcp_r|wmem) to have 15MB as max value (to
> >>> account for large BDP) on both sender and receiver. And turned on
> >>> usual performance tunables (timestamps, sacks etc.). My app (nuttcp)
> >>> does not set socket buffer sizes explicitly (verified via strace(8)),
> >>> so auto-tuning is active. And I've set congestion control algorithm to
> >>> "cubic" (tried "bic" and "reno" with similar results).
> >>>
> >>> Given all this, I am not able to deduce the cause of poor TCP
> >>> performance with 100ms RTT. tcptrace (tcptrace.org) for the 100ms dump
> >>> shows lot of ooo packets. Is that the reason? Or is it something else?
> >>>
> >>> Any hints and advice on fixing my setup? Or anything I need to
> >>> check/fix on the router? I'd like to see linux tcp fill the whole
> >>> 100ms/1Gbps link :-)
> >>>
> >>> If you need any more details please let me.
> >>>
> >>> Thanks in advance,
> >>
> >> Did you increase the tc netem limit parameter?  For example, in testing
> >> I have done emulating an 80 ms RTT path using a 10-GigE network, I use
> >> a limit setting of 20000.
> >
> > I set it to 100000 :-) current status on my router (post tests):
> > ----------------
> > router$ tc -s qdisc
> > [snip]
> > qdisc netem 8008: dev eth1 limit 100000 delay 50.0ms  10.0ms 25%
> >  Sent 3489810 bytes 45887 pkt (dropped 0, overlimits 0 requeues 0)
> >  rate 0bit 0pps backlog 0b 0p requeues 0
> > qdisc netem 8009: dev eth2 limit 100000 delay 50.0ms  10.0ms 25%
> >  Sent 68982220 bytes 51938 pkt (dropped 0, overlimits 0 requeues 0)
> >  rate 0bit 0pps backlog 0b 0p requeues 0
> > -----------------
> >
> > No dropped packets. No requeues. I know 100000 is too high, but things
> > (thruput wise) did not change much when the limit was 1000 (the
> > default).
> >
> >> Did you try explicitly setting the window size ("-w8m" nuttcp option
> >> for 100 ms RTT GigE path, and make sure the net.core.{r|w}mem sysctl
> >> parameters have been appropriately incresed)?
> >
> > No I did not try that. I thought buffer size auto-tuning will take
> > care of this (assuming my tcp_r|wmem values account for the large
> > BDP). Will try setting window size explicitly.
> 
> This did not help. Tried setting window values between 8MB to 13MB
> (BDP is 12.5MB) but no luck :-(
> 
> >> In my testing, I have also put all the delay on one interface rather
> >> than splitting it across two interfaces.  I don't know if that would
> >> make any difference or not.
> 
> THIS made a difference. The thruput jumped from 200KBps to 1MBps !! --
> nowhere near to what i'd like to see (i.e. more than 60MBps), but
> still an improvement :-)
> 
> I put all the delay on one interface -- causing the flow from sender
> -> receiver to be delayed by 100ms but no delay from the receiver back
> to the sender. Is the improvement because ack's come back faster? but
> the RTT is still 100ms. no?

>From the point of view of the TCP transmitter, it still sees the ACK
come back after 100 ms.  I was conjecturing that maybe using netem on
only a single interface might reduce or eliminate the reordered packets
you said you saw previously.

> Eitherways, the thruput is still very very low (1MBps max) :-( Any
> more ideas to get to multi-digit thruput numbers? :-)

That is very very low.  On your netem/router box, have you increased
the net.core.netdev_max_backlog and ifconfig txqueuelen parameters
(3000 and 1000 respectively for GigE is probably reasonable), and/or
adjusted the NIC's TX/RX interrupt coalescing parameters?  Another
thing to check is the NIC's flow control setting, which requires some
experimentation, as sometimes I have seen better performance with
flow control disabled, and other times I have seen better performance
with flow control enabled (it is currently enabled on my 10-GigE NICs).
I should also note that I have TSO disabled on all my NICs as it has
caused me lots of problems in the past (but does appear to be getting
better with newer kernels).

To check if there are any driver specific issues, check your NIC
statistics with "ethtool -S ethX".

It would probably be useful to also check what the highest UDP
data rate you can achieve without packet drops in each direction,
using the commands:

	xmit:	nuttcp -u -Ri###m -l1472 -w10m server_IP
	rcv:	nuttcp -u -r -Ri###m -l1472 -w10m server_IP

This will generate full 1500-bye Ethernet packets at a data rate
of ### Mbps.  Note the "-Ri" option indicates instantaneous rate
limiting (see the nuttcp man page for details), which is what should
be used for this type of testing.  Also note the "-w10m" option,
since high data rate UDP transfers require additional kernel socket
buffering, provided via the "-w" option.

An example:

[root@lang2 ~]# nuttcp -u -Ri1g -l1472 -w10m 192.168.89.15
 1192.4512 MB /  10.01 sec =  999.6380 Mbps 99 %TX 9 %RX 0 / 849440 drop/pkt 0.00 %loss

> Thanks!
> S.

						-You're welcome

						-Bill

> >> One oddity I have noticed is that using too large a window size with
> >> netem can be counter productive.  For example, using the following
> >> command to emulate an 80 ms RTT path:
> >>
> >>        tc qdisc add dev eth3 root handle 10: netem delay 80ms limit 20000
> >>
> >> [root@lang2 ~]# nuttcp -xt 192.168.89.15
> >> traceroute to 192.168.89.15 (192.168.89.15), 30 hops max, 40 byte packets
> >>  1  192.168.88.13 (192.168.88.13)  0.149 ms   0.122 ms   0.154 ms
> >>  2  192.168.89.15 (192.168.89.15)  82.242 ms   82.151 ms   82.671 ms
> >>
> >> traceroute to 192.168.88.14 (192.168.88.14), 30 hops max, 40 byte packets
> >>  1  192.168.89.13 (192.168.89.13)  81.105 ms   81.678 ms   81.680 ms
> >>  2  192.168.88.14 (192.168.88.14)  81.676 ms   81.739 ms   81.739 ms
> >>
> >> Using a window size of 80 MB is optimal:
> >>
> >> [root@lang2 ~]# nuttcp -w80m 192.168.89.15
> >>  8318.2500 MB /  10.10 sec = 6908.7737 Mbps 90 %TX 53 %RX 0 retrans 82.52 msRTT
> >>
> >> No TCP retransmisions and almost 7 Gbps of TCP throughput.
> >>
> >> But if I increae the window size to 100 MB:
> >>
> >> [root@lang2 ~]# nuttcp -w100m 192.168.89.15
> >>  5604.1875 MB /  10.13 sec = 4641.7343 Mbps 77 %TX 38 %RX 461 retrans 83.54 msRTT
> >>
> >> Many TCP retransmissions and the TCP throughput drops to less than 5 Gbps.
> >>
> >> And here's the autotuning case:
> >>
> >> [root@lang2 ~]# nuttcp 192.168.89.15
> >>  7324.8125 MB /  10.10 sec = 6086.0929 Mbps 79 %TX 47 %RX 0 retrans 84.07 msRTT
> >>
> >> Very good, with no TCP retransmissions, but not quite as optimal as
> >> the explicit 80 MB TCP window case.
> >>
> >> I'm not sure yet why overspecifying the TCP window size has this negative
> >> performance impact with netem.  Also note that Linux effectively gives
> >> you 50% more available TCP window size than what you explicitly request
> >> with nuttcp's "-w" option, which is why I specified "-w8m" earlier for
> >> the 100 ms RTT GigE path, when the actual BW*RTT calculation yields a
> >> required TCP window size of about 12 MB.
> >
> > Thanks for your insight. Will try setting window sizes explicitly and
> > report back. But given that you do get decent performance with
> > auto-tuning, I think the cause for very bad performance at my end is
> > something else. (I am getting ~200KBps with 100ms RTT :-)
> >
> > Is your setup similar to mine? Or are you using netem on the
> > sender/receiver node itself (with both of them connected back-2-back)
> > and no router in between?
> >
> > Thanks again!
> > S.
--
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html