From: Neal Cardwell <ncardwell@xxxxxxxxxx> Date: Fri, 9 Jun 2023 15:16:00 -0400 > On Fri, Jun 9, 2023 at 12:26 PM Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> wrote: > > Regarding the patch title: > > [PATCH net-next] tcp: Make pingpong threshold tunable > > There are many ways to make something tunable these days, including > BPF, setsockopt(), and sysctl. :-) This patch only uses sysctl. Please > consider a more clear/specific title, like: > > [PATCH net-next] tcp: set pingpong threshold via sysctl > > > TCP pingpong threshold is 1 by default. But some applications, like SQL DB > > may prefer a higher pingpong threshold to activate delayed acks in quick > > ack mode for better performance. > > > > The pingpong threshold and related code were changed to 3 in the year > > 2019, and reverted to 1 in the year 2022. > > Please include the specific commit, like: > > The pingpong threshold and related code were changed to 3 in the year > 2019 in: > commit 4a41f453bedf ("tcp: change pingpong threshold to 3") > and reverted to 1 in the year 2022 in: > commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"") > > Then please make sure to use scripts/checkpatch.pl on your resulting > patch to check the formatting of the commit references, among other > things. > > > There is no single value that > > fits all applications. > > > > Add net.core.tcp_pingpong_thresh sysctl tunable, > > For consistency, TCP sysctls should be in net.ipv4 rather than > net.core. Yes, that is awkward, given IPv6 support. But consistency is > very important here. :-) > > > so it can be tuned for > > optimal performance based on the application needs. > > > > Signed-off-by: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> > > --- > > Documentation/admin-guide/sysctl/net.rst | 8 ++++++++ > > include/net/inet_connection_sock.h | 14 +++++++++++--- > > net/core/sysctl_net_core.c | 9 +++++++++ > > net/ipv4/tcp.c | 2 ++ > > net/ipv4/tcp_output.c | 17 +++++++++++++++-- > > 5 files changed, 45 insertions(+), 5 deletions(-) > > > > diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst > > index 4877563241f3..16f54be9461f 100644 > > --- a/Documentation/admin-guide/sysctl/net.rst > > +++ b/Documentation/admin-guide/sysctl/net.rst > > @@ -413,6 +413,14 @@ historical importance. > > > > Default: 0 > > > > +tcp_pingpong_thresh > > +------------------- > > + > > +TCP pingpong threshold is 1 by default, but some application may need a higher > > +threshold for optimal performance. > > + > > +Default: 1, min: 1, max: 3 > > If we want to make this tunable, it seems sad to make the max 3. I'd > suggest making the max 255, since we have 8 bits of space anyway in > the inet_csk(sk)->icsk_ack.pingpong field. > > > + > > 2. /proc/sys/net/unix - Parameters for Unix domain sockets > > ---------------------------------------------------------- > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h > > index c2b15f7e5516..e84e33ddae49 100644 > > --- a/include/net/inet_connection_sock.h > > +++ b/include/net/inet_connection_sock.h > > @@ -324,11 +324,11 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb, > > > > struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu); > > > > -#define TCP_PINGPONG_THRESH 1 > > +extern int tcp_pingpong_thresh; > > To match most TCP sysctls, this should be per-namespace, rather than global. Also, please change int to u8. > > Please follow a recent example by Eric, perhaps: > 65466904b015f6eeb9225b51aeb29b01a1d4b59c > tcp: adjust TSO packet sizes based on min_rtt > > > > > > static inline void inet_csk_enter_pingpong_mode(struct sock *sk) > > { > > - inet_csk(sk)->icsk_ack.pingpong = TCP_PINGPONG_THRESH; > > + inet_csk(sk)->icsk_ack.pingpong = tcp_pingpong_thresh; > > } > > inet_csk(sk)->icsk_ack.pingpong = sock_net(sk)->sysctl_tcp_pingpong_thresh; Let's use READ_ONCE(sock_net(sk)->sysctl_tcp_pingpong_thresh). Same for other sysctl reads. > > > static inline void inet_csk_exit_pingpong_mode(struct sock *sk) > > @@ -338,7 +338,15 @@ static inline void inet_csk_exit_pingpong_mode(struct sock *sk) > > > > static inline bool inet_csk_in_pingpong_mode(struct sock *sk) > > { > > - return inet_csk(sk)->icsk_ack.pingpong >= TCP_PINGPONG_THRESH; > > + return inet_csk(sk)->icsk_ack.pingpong >= tcp_pingpong_thresh; > > +} > > Again, sock_net(sk)->sysctl_tcp_pingpong_thresh rather than tcp_pingpong_thresh. > > > +static inline void inet_csk_inc_pingpong_cnt(struct sock *sk) > > +{ > > + struct inet_connection_sock *icsk = inet_csk(sk); > > + > > + if (icsk->icsk_ack.pingpong < U8_MAX) > > + icsk->icsk_ack.pingpong++; > > } > > > > static inline bool inet_csk_has_ulp(struct sock *sk) > > diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c > > index 782273bb93c2..b5253567f2bd 100644 > > --- a/net/core/sysctl_net_core.c > > +++ b/net/core/sysctl_net_core.c > > @@ -653,6 +653,15 @@ static struct ctl_table net_core_table[] = { > > Again, in net.ipv4, not net.core. > > > .proc_handler = proc_dointvec_minmax, > > .extra1 = SYSCTL_ZERO, > > }, > > + { > > + .procname = "tcp_pingpong_thresh", > > + .data = &tcp_pingpong_thresh, > > + .maxlen = sizeof(int), > > + .mode = 0644, > > + .proc_handler = proc_dointvec_minmax, > > + .extra1 = SYSCTL_ONE, > > + .extra2 = SYSCTL_THREE, > > Please make the max U8_MAX to allow more flexibility (since we have 8 > bits of space anyway in the inet_csk(sk)->icsk_ack.pingpong field). Please use proc_dou8vec_minmax(), then you can drop .extra2. .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ONE, Thanks, Kuniyuki > > > + }, > > { } > > }; > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 53b7751b68e1..dcd143193d41 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -308,6 +308,8 @@ EXPORT_SYMBOL(tcp_have_smc); > > struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp; > > EXPORT_SYMBOL(tcp_sockets_allocated); > > > > +int tcp_pingpong_thresh __read_mostly = 1; > > + > > Again, per-network-namespace. You will need to initialize the > per-netns value in tcp_sk_init(). Again, see Eric's > 65466904b015f6eeb9225b51aeb29b01a1d4b59c commit for an example. > > > * TCP splice context > > */ > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > index cfe128b81a01..576d21621778 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -167,12 +167,25 @@ static void tcp_event_data_sent(struct tcp_sock *tp, > > if (tcp_packets_in_flight(tp) == 0) > > tcp_ca_event(sk, CA_EVENT_TX_START); > > > > + /* If tcp_pingpong_thresh > 1, and > > + * this is the first data packet sent in response to the > > + * previous received data, > > + * and it is a reply for ato after last received packet, > > + * increase pingpong count. > > + */ > > + if (tcp_pingpong_thresh > 1 && > > + before(tp->lsndtime, icsk->icsk_ack.lrcvtime) && > > + (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) > > + inet_csk_inc_pingpong_cnt(sk); > > + > > Introducing this new code re-introduces a bug fixed in 4d8f24eeedc5. > As that commit description noted: > > This to-be-reverted commit was meant to apply a stricter rule for the > stack to enter pingpong mode. However, the condition used to check for > interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is > jiffy based and might be too coarse, which delays the stack entering > pingpong mode. > We revert this patch so that we no longer use the above condition to > determine interactive session, > > > tp->lsndtime = now; > > > > - /* If it is a reply for ato after last received > > + /* If tcp_pingpong_thresh == 1, and > > Please remove the "If tcp_pingpong_thresh == 1, and" part, since this > is the correct code path no matter the value of the threshold. > > > + * it is a reply for ato after last received > > * packet, enter pingpong mode. > > */ > > - if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) > > + if (tcp_pingpong_thresh == 1 && > > Please remove the "if (tcp_pingpong_thresh == 1 &&" part, since this > is the correct code path no matter the value of the threshold. > > > + (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) > > inet_csk_enter_pingpong_mode(sk); > > Please make this call inet_csk_inc_pingpong_cnt(), since this is the > correct code path no matter the value of the threshold.