On 01/29/15 14:14, Eric Dumazet wrote:
On Thu, 2015-01-29 at 12:48 +0100, Michal Kazior wrote:
Hi,
I'm not subscribed to netdev list and I can't find the message-id so I
can't reply directly to the original thread `BW regression after "tcp:
refine TSO autosizing"`.
I've noticed a big TCP performance drop with ath10k
(drivers/net/wireless/ath/ath10k) on 3.19-rc5. Instead of 500mbps I
get 250mbps in my testbed.
After bisecting I ended up at `tcp: refine TSO autosizing`. Reverting
`tcp: refine TSO autosizing` and `tcp: Do not apply TSO segment limit
to non-TSO packets` (for conflict free reverts) fixes the problem.
My testing setup is as follows:
a) ath10k AP, github.com/kvalo/ath/tree/master 3.19-rc5, w/ reverts
b) ath10k STA connected to (a), github.com/kvalo/ath/tree/master
3.19-rc5, w/ reverts
c) (b) w/o reverts
Devices are 3x3 (AP) and 2x2 (Client) and are RF cabled. 11ac@80MHz
2x2 has 866mbps modulation rate. In practice this should deliver
~700mbps of real UDP traffic.
Here are some numbers:
UDP: (b) -> (a): 672mbps
UDP: (a) -> (b): 687mbps
TCP: (b) -> (a): 526mbps
TCP: (a) -> (b): 500mbps
UDP: (c) -> (a): 669mbps*
UDP: (a) -> (c): 689mbps*
TCP: (c) -> (a): 240mbps**
TCP: (a) -> (c): 490mbps*
* no changes/within error margin
** the performance drop
I'm using iperf:
UDP: iperf -i1 -s -u vs iperf -i1 -c XX -u -B 200M -P5 -t 20
TCP: iperf -i1 -s vs iperf -i1 -c XX -P5 -t 20
Result values were obtained at the receiver side.
Iperf reports a few frames lost and out-of-order at each UDP test
start (during first second) but later has no packet loss and no
out-of-order. This shouldn't have any effect on a TCP session, right?
The device delivers batched up tx/rx completions (no way to change
that). I suppose this could be an issue for timing sensitive
algorithms. Also keep in mind 802.11n and 802.11ac devices have frame
aggregation windows so there's an inherent extra (and non-uniform)
latency when compared to, e.g. ethernet devices.
The driver doesn't have GRO. I have an internal patch which implements
it. It improves overall TCP traffic (more stable, up to 600mbps TCP
which is ~100mbps more than without GRO) but the TCP: (c) -> (a)
performance drop remains unaffected regardless.
I've tried applying stretch ACK patchset (v2) on both machines and
re-run the above tests. I got no measurable difference in performance.
I've also run these tests with iwlwifi 7260 (also a 2x2) as (b) and
(c). It didn't seem to be affected by the TSO patch at all (it runs at
~360mbps of TCP regardless of the TSO patch).
Any hints/ideas?
Hi Michal
This patch restored original TSQ behavior, because the 1ms worth of data
per flow had totally destroyed TSQ intent.
vi +630 Documentation/networking/ip-sysctl.txt
tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
TCP bulk sender tends to increase packets in flight until it
gets losses notifications. With SNDBUF autotuning, this can
result in a large amount of packets queued in qdisc/device
on the local machine, hurting latency of other flows, for
typical pfifo_fast qdiscs.
tcp_limit_output_bytes limits the number of bytes on qdisc
or device to reduce artificial RTT/cwnd and reduce bufferbloat.
Default: 131072
This is why I suggested to Eyal Perry to change the TX interrupt
mitigation parameters as in :
ethtool -C eth0 tx-frames 4 rx-frames 4
With this change and the stretch ack fixes, I got 37Gbps of throughput
on a single flow, on a 40Gbit NIC (mlx4)
If a driver needs to buffer more than tcp_limit_output_bytes=131072 to
get line rate, I suggest that you either :
1) tweak tcp_limit_output_bytes, but its not practical from a driver.
2) change the driver, knowing what are its exact requirements, by
removing a fraction of skb->truesize at ndo_start_xmit() time as in :
if ((skb->destructor == sock_wfree ||
skb->restuctor == tcp_wfree)&&
skb->sk) {
u32 fraction = skb->truesize / 2;
skb->truesize -= fraction;
atomic_sub(fraction,&skb->sk->sk_wmem_alloc);
}
Hi Eric,
Your suggestions are still based on the fact that you consider wireless
networking to be similar to ethernet, but as Michal indicated there are
some fundamental differences starting with CSMA/CD versus CSMA/CA. Also
the medium conditions are far from comparable. There is no shielding so
it needs to deal with interference and dynamically drops the link rate
so transmission of packets can take several milliseconds. Then with 11n
they came up with aggregation with sends up to 64 packets in a single
transmit over the air at worst case 6.5 Mbps (if I am not mistaken). The
parameter value for tcp_limit_output_bytes of 131072 means that it
allows queuing for about 1ms on a 1Gbps link, but I hope you can see
this is not realistic for dealing with all variances of the wireless
medium/standard. I suggested this as topic for the wireless workshop in
Otawa [1], but I can not attend there. Still hope that there will be
some discussions to get more awareness.
Regards,
Arend
[1] http://mid.gmane.org/54BE9791.1070706@xxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html