Re: very low tcp thruput on LFN (using netem)

Bill Fink <billfink@xxxxxxxxxxxxxx> · Thu, 18 Dec 2008 02:04:54 -0500

On Wed, 17 Dec 2008, slashdev wrote:

> Hi,
> 
> I am experimenting with LFN using netem. My setup consists of two
> linux (sles10/2.6.16/64-bit) boxes connected via GigE link (no switch
> in between) to another linux box (sles10/64-bit) that acts has a
> router. I add delay (and other link characterstics) via netem on the
> router. Something like this:
> 
> [ sender ] ---- gige ---- [   router   ] ----- gige ---- [ receiver ]
> 192.1.1.2                1.1.1         1.2.1                 192.1.2.2
>                                eth1         eth2
> 
> sender and receiver have Broadcom/NetX II/Gige NICs. and the router
> has e1000 NICs. and TSO is turned-on on all NICs.
> 
> I insert equal delay using netem on the router to both eth1 and eth2
> interfaces. So to test with 100ms RTT, I add 50ms delay to outgoing
> traffic on both eth1 and eth2.
> 
> If I remove "netem" altogether, the RTT is around 0.1 ms (LAN times)
> between sender and receiver and I get a healthy ~110MBps thruput.
>   tcpdump trace: http://slashdev.googlepages.com/trace.nuttcp.0ms.gz
> 
> If I add 20ms RTT, the tcp thruput is still around 95Mbps (good).
> 
> If I add 100ms RTT, the tcp thruput crawls to 230KBps :-(
>   tcpdump trace: http://slashdev.googlepages.com/trace.nuttcp.100ms.gz
> 
> Note: the trace was collected from "eth2" on the router.
> 
> I've tuned the tcp buffers (tcp_r|wmem) to have 15MB as max value (to
> account for large BDP) on both sender and receiver. And turned on
> usual performance tunables (timestamps, sacks etc.). My app (nuttcp)
> does not set socket buffer sizes explicitly (verified via strace(8)),
> so auto-tuning is active. And I've set congestion control algorithm to
> "cubic" (tried "bic" and "reno" with similar results).
> 
> Given all this, I am not able to deduce the cause of poor TCP
> performance with 100ms RTT. tcptrace (tcptrace.org) for the 100ms dump
> shows lot of ooo packets. Is that the reason? Or is it something else?
> 
> Any hints and advice on fixing my setup? Or anything I need to
> check/fix on the router? I'd like to see linux tcp fill the whole
> 100ms/1Gbps link :-)
> 
> If you need any more details please let me.
> 
> Thanks in advance,

Did you increase the tc netem limit parameter?  For example, in testing
I have done emulating an 80 ms RTT path using a 10-GigE network, I use
a limit setting of 20000.

Did you try explicitly setting the window size ("-w8m" nuttcp option
for 100 ms RTT GigE path, and make sure the net.core.{r|w}mem sysctl
parameters have been appropriately incresed)?

In my testing, I have also put all the delay on one interface rather
than splitting it across two interfaces.  I don't know if that would
make any difference or not.

One oddity I have noticed is that using too large a window size with
netem can be counter productive.  For example, using the following
command to emulate an 80 ms RTT path:

	tc qdisc add dev eth3 root handle 10: netem delay 80ms limit 20000

[root@lang2 ~]# nuttcp -xt 192.168.89.15
traceroute to 192.168.89.15 (192.168.89.15), 30 hops max, 40 byte packets
 1  192.168.88.13 (192.168.88.13)  0.149 ms   0.122 ms   0.154 ms
 2  192.168.89.15 (192.168.89.15)  82.242 ms   82.151 ms   82.671 ms

traceroute to 192.168.88.14 (192.168.88.14), 30 hops max, 40 byte packets
 1  192.168.89.13 (192.168.89.13)  81.105 ms   81.678 ms   81.680 ms
 2  192.168.88.14 (192.168.88.14)  81.676 ms   81.739 ms   81.739 ms

Using a window size of 80 MB is optimal:

[root@lang2 ~]# nuttcp -w80m 192.168.89.15
 8318.2500 MB /  10.10 sec = 6908.7737 Mbps 90 %TX 53 %RX 0 retrans 82.52 msRTT

No TCP retransmisions and almost 7 Gbps of TCP throughput.

But if I increae the window size to 100 MB:

[root@lang2 ~]# nuttcp -w100m 192.168.89.15
 5604.1875 MB /  10.13 sec = 4641.7343 Mbps 77 %TX 38 %RX 461 retrans 83.54 msRTT

Many TCP retransmissions and the TCP throughput drops to less than 5 Gbps.

And here's the autotuning case:

[root@lang2 ~]# nuttcp 192.168.89.15
 7324.8125 MB /  10.10 sec = 6086.0929 Mbps 79 %TX 47 %RX 0 retrans 84.07 msRTT

Very good, with no TCP retransmissions, but not quite as optimal as
the explicit 80 MB TCP window case.

I'm not sure yet why overspecifying the TCP window size has this negative
performance impact with netem.  Also note that Linux effectively gives
you 50% more available TCP window size than what you explicitly request
with nuttcp's "-w" option, which is why I specified "-w8m" earlier for
the 100 ms RTT GigE path, when the actual BW*RTT calculation yields a
required TCP window size of about 12 MB.

						-Bill
--
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html