On Wed, Jul 27, 2022 at 1:19 PM Marek Majkowski <marek@xxxxxxxxxxxxxx> wrote: > > On Fri, Jul 22, 2022 at 11:23 AM Eric Dumazet <edumazet@xxxxxxxxxx> wrote: > > > > On Thu, Jul 21, 2022 at 5:10 PM Marek Majkowski <marek@xxxxxxxxxxxxxx> wrote: > > > > > > We already support RTAX_INITRWND / initrwnd path attribute: > > > > > > $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024 > > > > > > However normally, the initial advertised receive window is limited to > > > 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes > > > that, bumping up rcv_ssthresh to value derived from initrwnd. This > > > allows for larger initial advertised receive windows, which is useful > > > for specific types of TCP flows: big BDP ones, where there is a lot of > > > data to send immediately after the flow is established. > > > > > > There are three places where we initialize sockets: > > > - tcp_output:tcp_connect_init > > > - tcp_minisocks:tcp_openreq_init_rwin > > > - syncookies > > > > > > In the first two we already have a call to `tcp_rwnd_init_bpf` and > > > `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd > > > attribute. We use this value to bring `rcv_ssthresh` up, potentially > > > above the traditional 64KiB. > > > > > > With higher initial `rcv_ssthresh` the receiver will open the receive > > > window more aggresively, which can improve large BDP flows - large > > > throughput and latency. > > > > > > This patch does not cover the syncookies case. > > > > > > Signed-off-by: Marek Majkowski <marek@xxxxxxxxxxxxxx> > > > --- > > > include/net/inet_sock.h | 1 + > > > net/ipv4/tcp_minisocks.c | 8 ++++++-- > > > net/ipv4/tcp_output.c | 10 ++++++++-- > > > 3 files changed, 15 insertions(+), 4 deletions(-) > > > > > > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h > > > index daead5fb389a..bc68c9b70942 100644 > > > --- a/include/net/inet_sock.h > > > +++ b/include/net/inet_sock.h > > > @@ -89,6 +89,7 @@ struct inet_request_sock { > > > no_srccheck: 1, > > > smc_ok : 1; > > > u32 ir_mark; > > > + u32 rcv_ssthresh; Please move this in struct tcp_request_sock > > > > Why do we need to store this value in the request_sock ? > > > > It is derived from a route attribute and MSS, all this should be > > available when the full blown socket is created. > > > > It would also work even with syncookies. > > Eric, > > Thanks for the feedback. For some context, I published a blog post > explaining this work in detail [1]. > > https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/ > > I understand the suggestion is to move tcp_rwnd_init_bpf + > RTAX_INITRWND lookup from `tcp_openreq_init_rwin` into > `tcp_create_openreq_child`. > > I gave it a try (patch: [2]), but I don't think this will work under > all circumstances. The issue is that we need to advertise *some* > window in the SYNACK packet, before creating the full blown socket. > > With RTAX_INITRWND it is possible to move the advertised window up, or > down. > > In the latter case, of reducing the window, at the SYNACK moment we > must know if the window is reduced under 64KiB. This is what happens > right now, we can _reduce_ window with RTAX_INITRWND to small values, > I guess down to 1 MSS. This smaller window is then advertised in the > SYNACK. > > If we move RTAX_INITRWND lookup into the later > `tcp_create_openreq_child` then it will be too late, we won't know the > correct window size on SYNACK stage. We will likely end up sending > large window on SYNACK and then a small window on subsequent ACK, > violating TCP. > > There are two approaches here. First, keep the semantics and allow > RTAX_INITRWND to _reduce_ the initial window. > > In this case there are four ways out of this: > > 1) Keep it as proposed, that indeed requires some new value in > request_sock. (perhaps maybe it could be it smaller than u32) > > 2) Send the SYNACK with small/zero window, since we haven't done the > initrwnd lookup at this stage, but that would be at least > controversial, and also adds one more RTT to the common case. I don't > think this is acceptable. > > 3) Do two initrwnd lookups. One in the `tcp_openreq_init_rwin` to > figure out if the window is smaller than 64KiB, second one in > `tcp_create_openreq_child` to figure out if the suggested window is > larger than 64KiB. I think syncookies can be handled, if you look at cookie_v6_check() & cookie_v4_check() after their calls to cookie_tcp_reqsk_alloc() > > 4) Abort the whole approach and recycle Ivan's > bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) approach [3]. But I prefer the route > attribute approach, seems easier to use and more flexible. > > But, thinking about it, I don't think we could ever support reducing > initial receive window in the syncookie case. Only (3) - two initrwnd > lookups - could be made to work, but even that is controversial. > > However the intention of RTAX_INITRWND as far as I understand was to > _increase_ rcv_ssthresh, back in the days when it started from 10MSS > (so I was told). That was before we fixed DRS and that we made initial RWIN 65535, the max allowed value in a SYN , SYNACK packet. But yes... > > So, we could change the semantics of RTAX_INITRWND to allow only > *increasing* the window - and disallow reducing it. With such an > approach indeed we could make the code as you suggested, and move the > route attribute lookup away from minisocks into `tcp_create_openreq_child`. > > Marek > > [1] https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/ > [2] https://gist.github.com/majek/13848c050a3dc218ed295364ee717879 > [3] https://lore.kernel.org/bpf/20220111192952.49040-1-ivan@xxxxxxxxxxxxxx/t/