Patch "crypto: xor - fix template benchmarking" has been added to the 6.6-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Mon, 30 Sep 2024 19:49:21 -0400

This is a note to let you know that I've just added the patch titled

    crypto: xor - fix template benchmarking

to the 6.6-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     crypto-xor-fix-template-benchmarking.patch
and it can be found in the queue-6.6 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 5b0a9eb5bf8f2dd6d0dc295f173e03497ea2402a
Author: Helge Deller <deller@xxxxxxxxxx>
Date:   Mon Jul 8 14:24:52 2024 +0200

    crypto: xor - fix template benchmarking
    
    [ Upstream commit ab9a244c396aae4aaa34b2399b82fc15ec2df8c1 ]
    
    Commit c055e3eae0f1 ("crypto: xor - use ktime for template benchmarking")
    switched from using jiffies to ktime-based performance benchmarking.
    
    This works nicely on machines which have a fine-grained ktime()
    clocksource as e.g. x86 machines with TSC.
    But other machines, e.g. my 4-way HP PARISC server, don't have such
    fine-grained clocksources, which is why it seems that 800 xor loops
    take zero seconds, which then shows up in the logs as:
    
     xor: measuring software checksum speed
        8regs           : -1018167296 MB/sec
        8regs_prefetch  : -1018167296 MB/sec
        32regs          : -1018167296 MB/sec
        32regs_prefetch : -1018167296 MB/sec
    
    Fix this with some small modifications to the existing code to improve
    the algorithm to always produce correct results without introducing
    major delays for architectures with a fine-grained ktime()
    clocksource:
    a) Delay start of the timing until ktime() just advanced. On machines
    with a fast ktime() this should be just one additional ktime() call.
    b) Count the number of loops. Run at minimum 800 loops and finish
    earliest when the ktime() counter has progressed.
    
    With that the throughput can now be calculated more accurately under all
    conditions.
    
    Fixes: c055e3eae0f1 ("crypto: xor - use ktime for template benchmarking")
    Signed-off-by: Helge Deller <deller@xxxxxx>
    Tested-by: John David Anglin <dave.anglin@xxxxxxxx>
    
    v2:
    - clean up coding style (noticed & suggested by Herbert Xu)
    - rephrased & fixed typo in commit message
    
    Signed-off-by: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/crypto/xor.c b/crypto/xor.c
index 8e72e5d5db0de..56aa3169e8717 100644
--- a/crypto/xor.c
+++ b/crypto/xor.c
@@ -83,33 +83,30 @@ static void __init
 do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
 {
 	int speed;
-	int i, j;
-	ktime_t min, start, diff;
+	unsigned long reps;
+	ktime_t min, start, t0;
 
 	tmpl->next = template_list;
 	template_list = tmpl;
 
 	preempt_disable();
 
-	min = (ktime_t)S64_MAX;
-	for (i = 0; i < 3; i++) {
-		start = ktime_get();
-		for (j = 0; j < REPS; j++) {
-			mb(); /* prevent loop optimization */
-			tmpl->do_2(BENCH_SIZE, b1, b2);
-			mb();
-		}
-		diff = ktime_sub(ktime_get(), start);
-		if (diff < min)
-			min = diff;
-	}
+	reps = 0;
+	t0 = ktime_get();
+	/* delay start until time has advanced */
+	while ((start = ktime_get()) == t0)
+		cpu_relax();
+	do {
+		mb(); /* prevent loop optimization */
+		tmpl->do_2(BENCH_SIZE, b1, b2);
+		mb();
+	} while (reps++ < REPS || (t0 = ktime_get()) == start);
+	min = ktime_sub(t0, start);
 
 	preempt_enable();
 
 	// bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]
-	if (!min)
-		min = 1;
-	speed = (1000 * REPS * BENCH_SIZE) / (unsigned int)ktime_to_ns(min);
+	speed = (1000 * reps * BENCH_SIZE) / (unsigned int)ktime_to_ns(min);
 	tmpl->speed = speed;
 
 	pr_info("   %-16s: %5d MB/sec\n", tmpl->name, speed);