On Tue, Feb 15, 2022 at 2:37 AM <kerneljasonxing@xxxxxxxxx> wrote: > > From: Jason Xing <xingwanli@xxxxxxxxxxxx> > > Normally, user doesn't care the logic behind the kernel if they're > trying to set receive buffer via setsockopt. However, if the new value > of the receive buffer is not smaller than the initial value which is > sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's > wscale will shrink and then lead to the bad bandwidth. I think it is > not appropriate. Then do not use SO_RCVBUF ? It is working as intended really. > > Here are some numbers: > $ sysctl -a | grep rmem > net.core.rmem_default = 212992 > net.core.rmem_max = 40880000 > net.ipv4.tcp_rmem = 4096 425984 40880000 > > Case 1 > on the server side > # iperf -s -p 5201 > on the client side > # iperf -c [client ip] -p 5201 > It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of > server side is 10. It's good. > > Case 2 > on the server side > #iperf -s -p 5201 -w 425984 > on the client side > # iperf -c [client ip] -p 5201 > It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the > wcale is 2, even though the receive buffer is not changed at all at the > very beginning. Great, you discovered auto tuning is working as intended. > > Therefore, I added one condition where only user is trying to set a > smaller rx buffer. After this patch is applied, the bandwidth of case 2 > is recovered to 9.34 Gbits/sec. > > Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt") This commit has nothing to do with your patch or feature. > Signed-off-by: Jason Xing <xingwanli@xxxxxxxxxxxx> > --- > net/core/filter.c | 7 ++++--- > net/core/sock.c | 8 +++++--- > 2 files changed, 9 insertions(+), 6 deletions(-) > > diff --git a/net/core/filter.c b/net/core/filter.c > index 4603b7c..99f5d9c 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, > case SO_RCVBUF: > val = min_t(u32, val, sysctl_rmem_max); > val = min_t(int, val, INT_MAX / 2); > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > - WRITE_ONCE(sk->sk_rcvbuf, > - max_t(int, val * 2, SOCK_MIN_RCVBUF)); > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > + WRITE_ONCE(sk->sk_rcvbuf, val); > break; > case SO_SNDBUF: > val = min_t(u32, val, sysctl_wmem_max); > diff --git a/net/core/sock.c b/net/core/sock.c > index 4ff806d..e5e9cb0 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) > * as a negative value. > */ > val = min_t(int, val, INT_MAX / 2); > - sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > - > /* We double it on the way in to account for "struct sk_buff" etc. > * overhead. Applications assume that the SO_RCVBUF setting they make > * will allow that much actual data to be received on that socket. > @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val) > * And after considering the possible alternatives, returning the value > * we actually used in getsockopt is the most desirable behavior. > */ > - WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); > + val = max_t(int, val * 2, SOCK_MIN_RCVBUF); > + if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1]) > + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; > + > + WRITE_ONCE(sk->sk_rcvbuf, val); > } > > void sock_set_rcvbuf(struct sock *sk, int val) You are breaking applications that want to set sk->sk_rcvbuf to a fixed value, to control memory usage on millions of active sockets in a host. I think that you want new functionality, with new SO_ socket options, targeting net-next tree (No spurious FIxes: tag) For instance letting an application set or unset SOCK_RCVBUF_LOCK would be more useful, and would not break applications.