Re: very low tcp thruput on LFN (using netem)

slashdev <slashdev@xxxxxxxxx> · Thu, 18 Dec 2008 00:28:10 -0800

On Wed, Dec 17, 2008 at 11:04 PM, Bill Fink <billfink@xxxxxxxxxxxxxx> wrote:
> On Wed, 17 Dec 2008, slashdev wrote:
>
>> Hi,
>>
>> I am experimenting with LFN using netem. My setup consists of two
>> linux (sles10/2.6.16/64-bit) boxes connected via GigE link (no switch
>> in between) to another linux box (sles10/64-bit) that acts has a
>> router. I add delay (and other link characterstics) via netem on the
>> router. Something like this:
>>
>> [ sender ] ---- gige ---- [   router   ] ----- gige ---- [ receiver ]
>> 192.1.1.2                1.1.1         1.2.1                 192.1.2.2
>>                                eth1         eth2
>>
>> sender and receiver have Broadcom/NetX II/Gige NICs. and the router
>> has e1000 NICs. and TSO is turned-on on all NICs.
>>
>> I insert equal delay using netem on the router to both eth1 and eth2
>> interfaces. So to test with 100ms RTT, I add 50ms delay to outgoing
>> traffic on both eth1 and eth2.
>>
>> If I remove "netem" altogether, the RTT is around 0.1 ms (LAN times)
>> between sender and receiver and I get a healthy ~110MBps thruput.
>>   tcpdump trace: http://slashdev.googlepages.com/trace.nuttcp.0ms.gz
>>
>> If I add 20ms RTT, the tcp thruput is still around 95Mbps (good).
>>
>> If I add 100ms RTT, the tcp thruput crawls to 230KBps :-(
>>   tcpdump trace: http://slashdev.googlepages.com/trace.nuttcp.100ms.gz
>>
>> Note: the trace was collected from "eth2" on the router.
>>
>> I've tuned the tcp buffers (tcp_r|wmem) to have 15MB as max value (to
>> account for large BDP) on both sender and receiver. And turned on
>> usual performance tunables (timestamps, sacks etc.). My app (nuttcp)
>> does not set socket buffer sizes explicitly (verified via strace(8)),
>> so auto-tuning is active. And I've set congestion control algorithm to
>> "cubic" (tried "bic" and "reno" with similar results).
>>
>> Given all this, I am not able to deduce the cause of poor TCP
>> performance with 100ms RTT. tcptrace (tcptrace.org) for the 100ms dump
>> shows lot of ooo packets. Is that the reason? Or is it something else?
>>
>> Any hints and advice on fixing my setup? Or anything I need to
>> check/fix on the router? I'd like to see linux tcp fill the whole
>> 100ms/1Gbps link :-)
>>
>> If you need any more details please let me.
>>
>> Thanks in advance,
>
> Did you increase the tc netem limit parameter?  For example, in testing
> I have done emulating an 80 ms RTT path using a 10-GigE network, I use
> a limit setting of 20000.

I set it to 100000 :-) current status on my router (post tests):
----------------
router$ tc -s qdisc
[snip]
qdisc netem 8008: dev eth1 limit 100000 delay 50.0ms  10.0ms 25%
 Sent 3489810 bytes 45887 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc netem 8009: dev eth2 limit 100000 delay 50.0ms  10.0ms 25%
 Sent 68982220 bytes 51938 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
-----------------

No dropped packets. No requeues. I know 100000 is too high, but things
(thruput wise) did not change much when the limit was 1000 (the
default).

> Did you try explicitly setting the window size ("-w8m" nuttcp option
> for 100 ms RTT GigE path, and make sure the net.core.{r|w}mem sysctl
> parameters have been appropriately incresed)?

No I did not try that. I thought buffer size auto-tuning will take
care of this (assuming my tcp_r|wmem values account for the large
BDP). Will try setting window size explicitly.

> In my testing, I have also put all the delay on one interface rather
> than splitting it across two interfaces.  I don't know if that would
> make any difference or not.
>
> One oddity I have noticed is that using too large a window size with
> netem can be counter productive.  For example, using the following
> command to emulate an 80 ms RTT path:
>
>        tc qdisc add dev eth3 root handle 10: netem delay 80ms limit 20000
>
> [root@lang2 ~]# nuttcp -xt 192.168.89.15
> traceroute to 192.168.89.15 (192.168.89.15), 30 hops max, 40 byte packets
>  1  192.168.88.13 (192.168.88.13)  0.149 ms   0.122 ms   0.154 ms
>  2  192.168.89.15 (192.168.89.15)  82.242 ms   82.151 ms   82.671 ms
>
> traceroute to 192.168.88.14 (192.168.88.14), 30 hops max, 40 byte packets
>  1  192.168.89.13 (192.168.89.13)  81.105 ms   81.678 ms   81.680 ms
>  2  192.168.88.14 (192.168.88.14)  81.676 ms   81.739 ms   81.739 ms
>
> Using a window size of 80 MB is optimal:
>
> [root@lang2 ~]# nuttcp -w80m 192.168.89.15
>  8318.2500 MB /  10.10 sec = 6908.7737 Mbps 90 %TX 53 %RX 0 retrans 82.52 msRTT
>
> No TCP retransmisions and almost 7 Gbps of TCP throughput.
>
> But if I increae the window size to 100 MB:
>
> [root@lang2 ~]# nuttcp -w100m 192.168.89.15
>  5604.1875 MB /  10.13 sec = 4641.7343 Mbps 77 %TX 38 %RX 461 retrans 83.54 msRTT
>
> Many TCP retransmissions and the TCP throughput drops to less than 5 Gbps.
>
> And here's the autotuning case:
>
> [root@lang2 ~]# nuttcp 192.168.89.15
>  7324.8125 MB /  10.10 sec = 6086.0929 Mbps 79 %TX 47 %RX 0 retrans 84.07 msRTT
>
> Very good, with no TCP retransmissions, but not quite as optimal as
> the explicit 80 MB TCP window case.
>
> I'm not sure yet why overspecifying the TCP window size has this negative
> performance impact with netem.  Also note that Linux effectively gives
> you 50% more available TCP window size than what you explicitly request
> with nuttcp's "-w" option, which is why I specified "-w8m" earlier for
> the 100 ms RTT GigE path, when the actual BW*RTT calculation yields a
> required TCP window size of about 12 MB.

Thanks for your insight. Will try setting window sizes explicitly and
report back. But given that you do get decent performance with
auto-tuning, I think the cause for very bad performance at my end is
something else. (I am getting ~200KBps with 100ms RTT :-)

Is your setup similar to mine? Or are you using netem on the
sender/receiver node itself (with both of them connected back-2-back)
and no router in between?

Thanks again!
S.
--
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html