Re: performance regression on HiperSockets depending on MTU size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2012-11-26 at 16:32 +0100, Frank Blaschka wrote:
> Hi Eric,
> 
> since kernel 3.6 we see a massive performance regression on s390
> HiperSockets devices.
> 
> HiperSockets differ from normal devices by the fact they support
> large MTU sizes (up to 56K). Here are some iperf numbers to show
> the problem depended on MTU size:
> 
> # ifconfig hsi0 mtu 1500
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 47.6 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55855 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec    632 MBytes    530 Mbits/sec
> 
> # ifconfig hsi0 mtu 9000
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 97.0 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55856 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  2.26 GBytes  1.94 Gbits/sec
> 
> # ifconfig hsi0 mtu 32000
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size:   322 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55857 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.3 sec  3.12 MBytes  2.53 Mbits/sec
> 
> Prior the regression throughput grows with the MTU size but now it drops
> to a few Mbits if the MTU is bigger then 15000. It is interesting to see
> if 2 or more connections are running in parallel the regression is gone.
> 
> # ifconfig hsi0 mtu 32000
> # iperf -c 10.42.49.2 -P2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size:   322 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.42.49.1 port 55869 connected with 10.42.49.2 port 5001
> [  3] local 10.42.49.1 port 55868 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  4]  0.0-10.0 sec  2.19 GBytes  1.88 Gbits/sec
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  2.17 GBytes  1.87 Gbits/sec
> [SUM]  0.0-10.0 sec  4.36 GBytes  3.75 Gbits/sec
> 
> I bisected the problem to following patch:
> 
> commit 46d3ceabd8d98ed0ad10f20c595ca784e34786c5
> Author: Eric Dumazet <eric.dumazet@xxxxxxxxx>
> Date:   Wed Jul 11 05:50:31 2012 +0000
> 
>     tcp: TCP Small Queues
> 
>     This introduce TSQ (TCP Small Queues)
> 
>     TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
>     device queues), to reduce RTT and cwnd bias, part of the bufferbloat
>     problem.
> 
> Changing sysctl net.ipv4.tcp_limit_output_bytes to a higher value
> (e.g. 640000) seems to fix the problem.
> 
> How does MTU influence/effects TSQ?
> Why is the problem gone if there are more connections?
> Do you see any drawbacks by increasing net.ipv4.tcp_limit_output_bytes?
> Finally is this expected behavior or is there a bug depending on the big
> MTU? What can I do to check ... ?
> 

Hi Frank, thanks for this report.

You could tweak tcp_limit_output_bytes, but IMO the root of the problem
is in the driver itself.

For example, I had to change mlx4 driver for the same problem : Make
sure a TX packet can be "TX completed" in a short amount of time.

In the case of mlx4, the wait time was 128 us, but I suspect on your
case its more like an infinite time or several ms.
 
The driver is delaying the free of TX skb by a fixed amount of time,
or relies on following transmits to perform the TX completion


Check for an example :

commit ecfd2ce1a9d5e6376ff5c00b366345160abdbbb7
Author: Eric Dumazet <edumazet@xxxxxxxxxx>
Date:   Mon Nov 5 16:20:42 2012 +0000

    mlx4: change TX coalescing defaults
    
    mlx4 currently uses a too high tx coalescing setting, deferring
    TX completion interrupts by up to 128 us.
    
    With the recent skb_orphan() removal in commit 8112ec3b872,
    performance of a single TCP flow is capped to ~4 Gbps, unless
    we increase tcp_limit_output_bytes.
    
    I suggest using 16 us instead of 128 us, allowing a finer control.
    
    Performance of a single TCP flow is restored to previous levels,
    while keeping TCP small queues fully enabled with default sysctl.
    
    This patch is also a BQL prereq.
    
    Reported-by: Vimalkumar <j.vimal@xxxxxxxxx>
    Signed-off-by: Eric Dumazet <edumazet@xxxxxxxxxx>



--
To unsubscribe from this list: send the line "unsubscribe linux-s390" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Kernel Development]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite Info]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Linux Media]     [Device Mapper]

  Powered by Linux