Patch "udp: Avoid call to compute_score on multiple sites" has been added to the 6.6-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Sun, 26 May 2024 15:36:02 -0400

This is a note to let you know that I've just added the patch titled

    udp: Avoid call to compute_score on multiple sites

to the 6.6-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     udp-avoid-call-to-compute_score-on-multiple-sites.patch
and it can be found in the queue-6.6 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 8eb90392db01ad6734fb4bc36126caff28a99139
Author: Gabriel Krisman Bertazi <krisman@xxxxxxx>
Date:   Fri Apr 12 17:20:04 2024 -0400

    udp: Avoid call to compute_score on multiple sites
    
    [ Upstream commit 50aee97d15113b95a68848db1f0cb2a6c09f753a ]
    
    We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
    ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
    commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
    sockets are present").  The failing tests were those that would spawn
    UDP sockets per-cpu on systems that have a high number of cpus.
    
    Unsurprisingly, it is not caused by the extra re-scoring of the reused
    socket, but due to the compiler no longer inlining compute_score, once
    it has the extra call site in udp4_lib_lookup2.  This is augmented by
    the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
    
    We could just explicitly inline it, but compute_score() is quite a large
    function, around 300b.  Inlining in two sites would almost double
    udp4_lib_lookup2, which is a silly thing to do just to workaround a
    mitigation.  Instead, this patch shuffles the code a bit to avoid the
    multiple calls to compute_score.  Since it is a static function used in
    one spot, the compiler can safely fold it in, as it did before, without
    increasing the text size.
    
    With this patch applied I ran my original iperf3 testcases.  The failing
    cases all looked like this (ipv4):
            iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
    
    where $R is either 1G/10G/0 (max, unlimited).  I ran 3 times each.
    baseline is v6.9-rc3. harmean == harmonic mean; CV == coefficient of
    variation.
    
    ipv4:
                     1G                10G                  MAX
                HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
    baseline 1743852.66(0.0208) 1725933.02(0.0167) 1705203.78(0.0386)
    patched  1968727.61(0.0035) 1962283.22(0.0195) 1923853.50(0.0256)
    
    ipv6:
                     1G                10G                  MAX
                HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
    baseline 1729020.03(0.0028) 1691704.49(0.0243) 1692251.34(0.0083)
    patched  1900422.19(0.0067) 1900968.01(0.0067) 1568532.72(0.1519)
    
    This restores the performance we had before the change above with this
    benchmark.  We obviously don't expect any real impact when mitigations
    are disabled, but just to be sure it also doesn't regresses:
    
    mitigations=off ipv4:
                     1G                10G                  MAX
                HARMEAN  (CV)      HARMEAN  (CV)    HARMEAN     (CV)
    baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
    patched  3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
    
    Cc: Lorenz Bauer <lmb@xxxxxxxxxxxxx>
    Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
    Signed-off-by: Gabriel Krisman Bertazi <krisman@xxxxxxx>
    Reviewed-by: Kuniyuki Iwashima <kuniyu@xxxxxxxxxx>
    Reviewed-by: Willem de Bruijn <willemb@xxxxxxxxxx>
    Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ca576587f6d21..16ca211c8619d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -429,15 +429,21 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
+	bool need_rescore;
 
 	result = NULL;
 	badness = 0;
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		score = compute_score(sk, net, saddr, sport,
-				      daddr, hnum, dif, sdif);
+		need_rescore = false;
+rescore:
+		score = compute_score(need_rescore ? result : sk, net, saddr,
+				      sport, daddr, hnum, dif, sdif);
 		if (score > badness) {
 			badness = score;
 
+			if (need_rescore)
+				continue;
+
 			if (sk->sk_state == TCP_ESTABLISHED) {
 				result = sk;
 				continue;
@@ -458,9 +464,14 @@ static struct sock *udp4_lib_lookup2(struct net *net,
 			if (IS_ERR(result))
 				continue;
 
-			badness = compute_score(result, net, saddr, sport,
-						daddr, hnum, dif, sdif);
-
+			/* compute_score is too long of a function to be
+			 * inlined, and calling it again here yields
+			 * measureable overhead for some
+			 * workloads. Work around it by jumping
+			 * backwards to rescore 'result'.
+			 */
+			need_rescore = true;
+			goto rescore;
 		}
 	}
 	return result;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 124cf2bb2a6d7..c77ee9a3cde24 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -171,15 +171,21 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 {
 	struct sock *sk, *result;
 	int score, badness;
+	bool need_rescore;
 
 	result = NULL;
 	badness = -1;
 	udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
-		score = compute_score(sk, net, saddr, sport,
-				      daddr, hnum, dif, sdif);
+		need_rescore = false;
+rescore:
+		score = compute_score(need_rescore ? result : sk, net, saddr,
+				      sport, daddr, hnum, dif, sdif);
 		if (score > badness) {
 			badness = score;
 
+			if (need_rescore)
+				continue;
+
 			if (sk->sk_state == TCP_ESTABLISHED) {
 				result = sk;
 				continue;
@@ -200,8 +206,14 @@ static struct sock *udp6_lib_lookup2(struct net *net,
 			if (IS_ERR(result))
 				continue;
 
-			badness = compute_score(sk, net, saddr, sport,
-						daddr, hnum, dif, sdif);
+			/* compute_score is too long of a function to be
+			 * inlined, and calling it again here yields
+			 * measureable overhead for some
+			 * workloads. Work around it by jumping
+			 * backwards to rescore 'result'.
+			 */
+			need_rescore = true;
+			goto rescore;
 		}
 	}
 	return result;