On Wed, 17 Dec 2008, slashdev wrote: > Hi, > > I am experimenting with LFN using netem. My setup consists of two > linux (sles10/2.6.16/64-bit) boxes connected via GigE link (no switch > in between) to another linux box (sles10/64-bit) that acts has a > router. I add delay (and other link characterstics) via netem on the > router. Something like this: > > [ sender ] ---- gige ---- [ router ] ----- gige ---- [ receiver ] > 192.1.1.2 1.1.1 1.2.1 192.1.2.2 > eth1 eth2 > > sender and receiver have Broadcom/NetX II/Gige NICs. and the router > has e1000 NICs. and TSO is turned-on on all NICs. > > I insert equal delay using netem on the router to both eth1 and eth2 > interfaces. So to test with 100ms RTT, I add 50ms delay to outgoing > traffic on both eth1 and eth2. > > If I remove "netem" altogether, the RTT is around 0.1 ms (LAN times) > between sender and receiver and I get a healthy ~110MBps thruput. > tcpdump trace: http://slashdev.googlepages.com/trace.nuttcp.0ms.gz > > If I add 20ms RTT, the tcp thruput is still around 95Mbps (good). > > If I add 100ms RTT, the tcp thruput crawls to 230KBps :-( > tcpdump trace: http://slashdev.googlepages.com/trace.nuttcp.100ms.gz > > Note: the trace was collected from "eth2" on the router. > > I've tuned the tcp buffers (tcp_r|wmem) to have 15MB as max value (to > account for large BDP) on both sender and receiver. And turned on > usual performance tunables (timestamps, sacks etc.). My app (nuttcp) > does not set socket buffer sizes explicitly (verified via strace(8)), > so auto-tuning is active. And I've set congestion control algorithm to > "cubic" (tried "bic" and "reno" with similar results). > > Given all this, I am not able to deduce the cause of poor TCP > performance with 100ms RTT. tcptrace (tcptrace.org) for the 100ms dump > shows lot of ooo packets. Is that the reason? Or is it something else? > > Any hints and advice on fixing my setup? Or anything I need to > check/fix on the router? I'd like to see linux tcp fill the whole > 100ms/1Gbps link :-) > > If you need any more details please let me. > > Thanks in advance, Did you increase the tc netem limit parameter? For example, in testing I have done emulating an 80 ms RTT path using a 10-GigE network, I use a limit setting of 20000. Did you try explicitly setting the window size ("-w8m" nuttcp option for 100 ms RTT GigE path, and make sure the net.core.{r|w}mem sysctl parameters have been appropriately incresed)? In my testing, I have also put all the delay on one interface rather than splitting it across two interfaces. I don't know if that would make any difference or not. One oddity I have noticed is that using too large a window size with netem can be counter productive. For example, using the following command to emulate an 80 ms RTT path: tc qdisc add dev eth3 root handle 10: netem delay 80ms limit 20000 [root@lang2 ~]# nuttcp -xt 192.168.89.15 traceroute to 192.168.89.15 (192.168.89.15), 30 hops max, 40 byte packets 1 192.168.88.13 (192.168.88.13) 0.149 ms 0.122 ms 0.154 ms 2 192.168.89.15 (192.168.89.15) 82.242 ms 82.151 ms 82.671 ms traceroute to 192.168.88.14 (192.168.88.14), 30 hops max, 40 byte packets 1 192.168.89.13 (192.168.89.13) 81.105 ms 81.678 ms 81.680 ms 2 192.168.88.14 (192.168.88.14) 81.676 ms 81.739 ms 81.739 ms Using a window size of 80 MB is optimal: [root@lang2 ~]# nuttcp -w80m 192.168.89.15 8318.2500 MB / 10.10 sec = 6908.7737 Mbps 90 %TX 53 %RX 0 retrans 82.52 msRTT No TCP retransmisions and almost 7 Gbps of TCP throughput. But if I increae the window size to 100 MB: [root@lang2 ~]# nuttcp -w100m 192.168.89.15 5604.1875 MB / 10.13 sec = 4641.7343 Mbps 77 %TX 38 %RX 461 retrans 83.54 msRTT Many TCP retransmissions and the TCP throughput drops to less than 5 Gbps. And here's the autotuning case: [root@lang2 ~]# nuttcp 192.168.89.15 7324.8125 MB / 10.10 sec = 6086.0929 Mbps 79 %TX 47 %RX 0 retrans 84.07 msRTT Very good, with no TCP retransmissions, but not quite as optimal as the explicit 80 MB TCP window case. I'm not sure yet why overspecifying the TCP window size has this negative performance impact with netem. Also note that Linux effectively gives you 50% more available TCP window size than what you explicitly request with nuttcp's "-w" option, which is why I specified "-w8m" earlier for the 100 ms RTT GigE path, when the actual BW*RTT calculation yields a required TCP window size of about 12 MB. -Bill -- To unsubscribe from this list: send the line "unsubscribe linux-net" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html