Patch "x86/asm/64: Align start of __clear_user() loop to 16-bytes" has been added to the 4.19-stable tree

<gregkh@xxxxxxxxxxxxxxxxxxx> · Mon, 29 Jun 2020 12:49:10 +0200

This is a note to let you know that I've just added the patch titled

    x86/asm/64: Align start of __clear_user() loop to 16-bytes

to the 4.19-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     x86-asm-64-align-start-of-__clear_user-loop-to-16-bytes.patch
and it can be found in the queue-4.19 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.


>From bb5570ad3b54e7930997aec76ab68256d5236d94 Mon Sep 17 00:00:00 2001
From: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
Date: Thu, 18 Jun 2020 11:20:02 +0100
Subject: x86/asm/64: Align start of __clear_user() loop to 16-bytes

From: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>

commit bb5570ad3b54e7930997aec76ab68256d5236d94 upstream.

x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window, or worse, a 64-byte cacheline. This issues was discovered in the
SUSE kernel with the following commit,

  1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")

which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across a
64-byte cacheline.

Aligning the start of the loop to 16-bytes makes this fit neatly inside
a single instruction fetch window again and restores the performance of
__clear_user() which is used heavily when reading from /dev/zero.

Here are some numbers from running libmicro's read_z* and pread_z*
microbenchmarks which read from /dev/zero:

  Zen 1 (Naples)

  libmicro-file
                                        5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                                    revert-1153933703d9+               align16+
  Time mean95-pread_z100k       9.9195 (   0.00%)      5.9856 (  39.66%)      5.9938 (  39.58%)
  Time mean95-pread_z10k        1.1378 (   0.00%)      0.7450 (  34.52%)      0.7467 (  34.38%)
  Time mean95-pread_z1k         0.2623 (   0.00%)      0.2251 (  14.18%)      0.2252 (  14.15%)
  Time mean95-pread_zw100k      9.9974 (   0.00%)      6.0648 (  39.34%)      6.0756 (  39.23%)
  Time mean95-read_z100k        9.8940 (   0.00%)      5.9885 (  39.47%)      5.9994 (  39.36%)
  Time mean95-read_z10k         1.1394 (   0.00%)      0.7483 (  34.33%)      0.7482 (  34.33%)

Note that this doesn't affect Haswell or Broadwell microarchitectures
which seem to avoid the alignment issue by executing the loop straight
out of the Loop Stream Detector (verified using perf events).

Fixes: 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
Signed-off-by: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
Signed-off-by: Borislav Petkov <bp@xxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx> # v4.19+
Link: https://lkml.kernel.org/r/20200618102002.30034-1-matt@xxxxxxxxxxxxxxxxxxx
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

---
 arch/x86/lib/usercopy_64.c |    1 +
 1 file changed, 1 insertion(+)

--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -23,6 +23,7 @@ unsigned long __clear_user(void __user *
 	asm volatile(
 		"	testq  %[size8],%[size8]\n"
 		"	jz     4f\n"
+		"	.align 16\n"
 		"0:	movq $0,(%[dst])\n"
 		"	addq   $8,%[dst]\n"
 		"	decl %%ecx ; jnz   0b\n"


Patches currently in stable-queue which might be from matt@xxxxxxxxxxxxxxxxxxx are

queue-4.19/x86-asm-64-align-start-of-__clear_user-loop-to-16-bytes.patch