Re: [PATCH net-next] tcp: optimise receiver buffer autotuning initialisation for high latency connections

Neal Cardwell <ncardwell@xxxxxxxxxx> · Fri, 4 Dec 2020 16:28:35 -0500

On Fri, Dec 4, 2020 at 1:08 PM Hazem Mohamed Abuelfotoh
<abuehaze@xxxxxxxxxx> wrote:
>
>     Previously receiver buffer auto-tuning starts after receiving
>     one advertised window amount of data.After the initial
>     receiver buffer was raised by
>     commit a337531b942b ("tcp: up initial rmem to 128KB
>     and SYN rwin to around 64KB"),the receiver buffer may
>     take too long for TCP autotuning to start raising
>     the receiver buffer size.
>     commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
>     tried to decrease the threshold at which TCP auto-tuning starts
>     but it's doesn't work well in some environments
>     where the receiver has large MTU (9001) configured
>     specially within environments where RTT is high.
>     To address this issue this patch is relying on RCV_MSS
>     so auto-tuning can start early regardless
>     the receiver configured MTU.
>
>     Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
>     Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
>
> Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@xxxxxxxxxx>
> ---
>  net/ipv4/tcp_input.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 389d1b340248..f0ffac9e937b 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
>  static void tcp_init_buffer_space(struct sock *sk)
>  {
>         int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
> +       struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         int maxwin;
>
>         if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
>                 tcp_sndbuf_expand(sk);
>
> -       tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
> +       tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);

Thanks for the detailed report and the proposed fix.

AFAICT the core of the bug is related to this part of your follow-up email:

> 10)It looks like when the issue happen we have a  kind of deadlock here
> so advertised receive window has to exceed rcvq_space for the tcp auto
> tuning to kickoff at the same time with the initial default  configuration the
> receive window is not going to exceed rcvq_space

The existing code is:

> -       tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);

(1) With a typical case where both sides have an advmss based on an
MTU of 1500 bytes, the governing limit here is around 10*1460 or so,
or around 15Kbytes. With a tp->rcvq_space.space of that size, it is
common for the data sender to send that amount of data in one round
trip, which will trigger the receive buffer autotuning code in
tcp_rcv_space_adjust(), so autotuning works well.

(2) With a case where the sender has a 1500B NIC and the receiver has
an advmss based on an MTU of 9KB, then the  expression becomes
governed by the receive window of 64KBytes:

  tp->rcvq_space.space
    ~=  min(rcv_wnd, 10*9KBytes)
    ~= min(64KByte,90KByte) = 65536 bytes

But because tp->rcvq_space.space is set to the rcv_wnd, and
the sender is not allowed to send more than the receive
window, this check in tcp_rcv_space_adjust() will always fail,
and the receiver will take the new_measure path:

        /* Number of bytes copied to user in last RTT */
        copied = tp->copied_seq - tp->rcvq_space.seq;
        if (copied <= tp->rcvq_space.space)
                goto new_measure;

This seems to be why receive buffer autotuning is never triggered.

Furthermore, if we try to fix it with:

-        if (copied <= tp->rcvq_space.space)
+        if (copied < tp->rcvq_space.space)

The buggy behavior will still remain, because 65536 is not a multiple
of the MSS, so the number of bytes copied to the user in the last RTT
will never, in practice, exactly match the tp->rcvq_space.space.

AFAICT the proposed patch fixes this bug by setting the
tp->rcvq_space.space to 10*536 = 5360. This is a number of bytes that
*can* actually be delivered in a round trip in a mixed 1500B/9KB
scenario, so this allows receive window auto-tuning to actually be
triggered.

It seems like a reasonable approach to fix this issue, but I agree
with Eric that it would be good to improve the commit message a bit.

Also, since this is a bug fix, it seems it should be directed to the
"net" branch rather than the "net-next" branch.

Also, FWIW, I think the "for high latency connections" can be dropped
from the commit summary/first-line, since the bug can be hit at any
RTT, and is simply easier to notice at high RTTs. You might consider
something like:

  tcp: fix receive buffer autotuning to trigger for any valid advertised MSS

best,
neal