> -----Original Message----- > From: Kuniyuki Iwashima <kuniyu@xxxxxxxxxx> > Sent: Friday, June 9, 2023 3:54 PM > To: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> > Cc: ncardwell@xxxxxxxxxx; atenart@xxxxxxxxxx; bagasdotme@xxxxxxxxx; > corbet@xxxxxxx; davem@xxxxxxxxxxxxx; dsahern@xxxxxxxxxx; > edumazet@xxxxxxxxxx; kuba@xxxxxxxxxx; kuniyu@xxxxxxxxxx; KY > Srinivasan <kys@xxxxxxxxxxxxx>; linux-doc@xxxxxxxxxxxxxxx; linux- > hyperv@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > liushixin2@xxxxxxxxxx; maheshb@xxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; > olaf@xxxxxxxxx; pabeni@xxxxxxxxxx; simon.horman@xxxxxxxxxxxx; > soheil@xxxxxxxxxx; stephen@xxxxxxxxxxxxxxxxxx; Tim Gardner > <tim.gardner@xxxxxxxxxxxxx>; vkuznets@xxxxxxxxxx; weiwan@xxxxxxxxxx; > ycheng@xxxxxxxxxx; ykaliuta@xxxxxxxxxx > Subject: Re: [PATCH net-next] tcp: Make pingpong threshold tunable > > [You don't often get email from kuniyu@xxxxxxxxxx. Learn why this is > important at https://aka.ms/LearnAboutSenderIdentification ] > > From: Neal Cardwell <ncardwell@xxxxxxxxxx> > Date: Fri, 9 Jun 2023 15:16:00 -0400 > > On Fri, Jun 9, 2023 at 12:26 PM Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> > wrote: > > > > Regarding the patch title: > > > [PATCH net-next] tcp: Make pingpong threshold tunable > > > > There are many ways to make something tunable these days, including > > BPF, setsockopt(), and sysctl. :-) This patch only uses sysctl. Please > > consider a more clear/specific title, like: > > > > [PATCH net-next] tcp: set pingpong threshold via sysctl > > > > > TCP pingpong threshold is 1 by default. But some applications, like SQL DB > > > may prefer a higher pingpong threshold to activate delayed acks in quick > > > ack mode for better performance. > > > > > > The pingpong threshold and related code were changed to 3 in the year > > > 2019, and reverted to 1 in the year 2022. > > > > Please include the specific commit, like: > > > > The pingpong threshold and related code were changed to 3 in the year > > 2019 in: > > commit 4a41f453bedf ("tcp: change pingpong threshold to 3") > > and reverted to 1 in the year 2022 in: > > commit 4d8f24eeedc5 ("Revert "tcp: change pingpong threshold to 3"") > > > > Then please make sure to use scripts/checkpatch.pl on your resulting > > patch to check the formatting of the commit references, among other > > things. > > > > > There is no single value that > > > fits all applications. > > > > > > Add net.core.tcp_pingpong_thresh sysctl tunable, > > > > For consistency, TCP sysctls should be in net.ipv4 rather than > > net.core. Yes, that is awkward, given IPv6 support. But consistency is > > very important here. :-) > > > > > so it can be tuned for > > > optimal performance based on the application needs. > > > > > > Signed-off-by: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> > > > --- > > > Documentation/admin-guide/sysctl/net.rst | 8 ++++++++ > > > include/net/inet_connection_sock.h | 14 +++++++++++--- > > > net/core/sysctl_net_core.c | 9 +++++++++ > > > net/ipv4/tcp.c | 2 ++ > > > net/ipv4/tcp_output.c | 17 +++++++++++++++-- > > > 5 files changed, 45 insertions(+), 5 deletions(-) > > > > > > diff --git a/Documentation/admin-guide/sysctl/net.rst > b/Documentation/admin-guide/sysctl/net.rst > > > index 4877563241f3..16f54be9461f 100644 > > > --- a/Documentation/admin-guide/sysctl/net.rst > > > +++ b/Documentation/admin-guide/sysctl/net.rst > > > @@ -413,6 +413,14 @@ historical importance. > > > > > > Default: 0 > > > > > > +tcp_pingpong_thresh > > > +------------------- > > > + > > > +TCP pingpong threshold is 1 by default, but some application may need a > higher > > > +threshold for optimal performance. > > > + > > > +Default: 1, min: 1, max: 3 > > > > If we want to make this tunable, it seems sad to make the max 3. I'd > > suggest making the max 255, since we have 8 bits of space anyway in > > the inet_csk(sk)->icsk_ack.pingpong field. > > > > > + > > > 2. /proc/sys/net/unix - Parameters for Unix domain sockets > > > ---------------------------------------------------------- > > > > > > diff --git a/include/net/inet_connection_sock.h > b/include/net/inet_connection_sock.h > > > index c2b15f7e5516..e84e33ddae49 100644 > > > --- a/include/net/inet_connection_sock.h > > > +++ b/include/net/inet_connection_sock.h > > > @@ -324,11 +324,11 @@ void inet_csk_update_fastreuse(struct > inet_bind_bucket *tb, > > > > > > struct dst_entry *inet_csk_update_pmtu(struct sock *sk, u32 mtu); > > > > > > -#define TCP_PINGPONG_THRESH 1 > > > +extern int tcp_pingpong_thresh; > > > > To match most TCP sysctls, this should be per-namespace, rather than > global. > > Also, please change int to u8. > > > > > > Please follow a recent example by Eric, perhaps: > > 65466904b015f6eeb9225b51aeb29b01a1d4b59c > > tcp: adjust TSO packet sizes based on min_rtt > > > > > > > > > > static inline void inet_csk_enter_pingpong_mode(struct sock *sk) > > > { > > > - inet_csk(sk)->icsk_ack.pingpong = TCP_PINGPONG_THRESH; > > > + inet_csk(sk)->icsk_ack.pingpong = tcp_pingpong_thresh; > > > } > > > > inet_csk(sk)->icsk_ack.pingpong = sock_net(sk)- > >sysctl_tcp_pingpong_thresh; > > Let's use READ_ONCE(sock_net(sk)->sysctl_tcp_pingpong_thresh). > Same for other sysctl reads. > > > > > > > static inline void inet_csk_exit_pingpong_mode(struct sock *sk) > > > @@ -338,7 +338,15 @@ static inline void > inet_csk_exit_pingpong_mode(struct sock *sk) > > > > > > static inline bool inet_csk_in_pingpong_mode(struct sock *sk) > > > { > > > - return inet_csk(sk)->icsk_ack.pingpong >= TCP_PINGPONG_THRESH; > > > + return inet_csk(sk)->icsk_ack.pingpong >= tcp_pingpong_thresh; > > > +} > > > > Again, sock_net(sk)->sysctl_tcp_pingpong_thresh rather than > tcp_pingpong_thresh. > > > > > +static inline void inet_csk_inc_pingpong_cnt(struct sock *sk) > > > +{ > > > + struct inet_connection_sock *icsk = inet_csk(sk); > > > + > > > + if (icsk->icsk_ack.pingpong < U8_MAX) > > > + icsk->icsk_ack.pingpong++; > > > } > > > > > > static inline bool inet_csk_has_ulp(struct sock *sk) > > > diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c > > > index 782273bb93c2..b5253567f2bd 100644 > > > --- a/net/core/sysctl_net_core.c > > > +++ b/net/core/sysctl_net_core.c > > > @@ -653,6 +653,15 @@ static struct ctl_table net_core_table[] = { > > > > Again, in net.ipv4, not net.core. > > > > > .proc_handler = proc_dointvec_minmax, > > > .extra1 = SYSCTL_ZERO, > > > }, > > > + { > > > + .procname = "tcp_pingpong_thresh", > > > + .data = &tcp_pingpong_thresh, > > > + .maxlen = sizeof(int), > > > + .mode = 0644, > > > + .proc_handler = proc_dointvec_minmax, > > > + .extra1 = SYSCTL_ONE, > > > + .extra2 = SYSCTL_THREE, > > > > Please make the max U8_MAX to allow more flexibility (since we have 8 > > bits of space anyway in the inet_csk(sk)->icsk_ack.pingpong field). > > Please use proc_dou8vec_minmax(), then you can drop .extra2. > > .maxlen = sizeof(u8), > .mode = 0644, > .proc_handler = proc_dou8vec_minmax, > .extra1 = SYSCTL_ONE, > > Thanks, > Kuniyuki > > > > > > + }, > > > { } > > > }; > > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > > index 53b7751b68e1..dcd143193d41 100644 > > > --- a/net/ipv4/tcp.c > > > +++ b/net/ipv4/tcp.c > > > @@ -308,6 +308,8 @@ EXPORT_SYMBOL(tcp_have_smc); > > > struct percpu_counter tcp_sockets_allocated > ____cacheline_aligned_in_smp; > > > EXPORT_SYMBOL(tcp_sockets_allocated); > > > > > > +int tcp_pingpong_thresh __read_mostly = 1; > > > + > > > > Again, per-network-namespace. You will need to initialize the > > per-netns value in tcp_sk_init(). Again, see Eric's > > 65466904b015f6eeb9225b51aeb29b01a1d4b59c commit for an example. > > > > > * TCP splice context > > > */ > > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > > > index cfe128b81a01..576d21621778 100644 > > > --- a/net/ipv4/tcp_output.c > > > +++ b/net/ipv4/tcp_output.c > > > @@ -167,12 +167,25 @@ static void tcp_event_data_sent(struct > tcp_sock *tp, > > > if (tcp_packets_in_flight(tp) == 0) > > > tcp_ca_event(sk, CA_EVENT_TX_START); > > > > > > + /* If tcp_pingpong_thresh > 1, and > > > + * this is the first data packet sent in response to the > > > + * previous received data, > > > + * and it is a reply for ato after last received packet, > > > + * increase pingpong count. > > > + */ > > > + if (tcp_pingpong_thresh > 1 && > > > + before(tp->lsndtime, icsk->icsk_ack.lrcvtime) && > > > + (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) > > > + inet_csk_inc_pingpong_cnt(sk); > > > + > > > > Introducing this new code re-introduces a bug fixed in 4d8f24eeedc5. > > As that commit description noted: > > > > This to-be-reverted commit was meant to apply a stricter rule for the > > stack to enter pingpong mode. However, the condition used to check for > > interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is > > jiffy based and might be too coarse, which delays the stack entering > > pingpong mode. > > We revert this patch so that we no longer use the above condition to > > determine interactive session, > > > > > tp->lsndtime = now; > > > > > > - /* If it is a reply for ato after last received > > > + /* If tcp_pingpong_thresh == 1, and > > > > Please remove the "If tcp_pingpong_thresh == 1, and" part, since this > > is the correct code path no matter the value of the threshold. > > > > > + * it is a reply for ato after last received > > > * packet, enter pingpong mode. > > > */ > > > - if ((u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) > > > + if (tcp_pingpong_thresh == 1 && > > > > Please remove the "if (tcp_pingpong_thresh == 1 &&" part, since this > > is the correct code path no matter the value of the threshold. > > > > > + (u32)(now - icsk->icsk_ack.lrcvtime) < icsk->icsk_ack.ato) > > > inet_csk_enter_pingpong_mode(sk); > > > > Please make this call inet_csk_inc_pingpong_cnt(), since this is the > > correct code path no matter the value of the threshold. Thank Neal and Kuniyuki for the detailed reviews! I will update the code accordingly. - Haiyang