I spotted rep_movs_alternative on a flamegraph while building the kernel on an AMD CPU and I was surprised to find it does not have ERMS. The 64 byte loop there can be trivially replaced with movsq and I don't think there are any reasons to NOT do it. If anything I'm surprised the non-ERMS case was left hanging around after the 32KB regression reported by Eric Dumazet. Anyhow patch below remedies it. The patch initially had: SYM_FUNC_START(rep_movs_alternative) cmpq $64,%rcx - jae .Llarge + ALTERNATIVE "jae .Lrep_movsq", "jae .Lrep_movsb", X86_FEATURE_ERMS But then I found the weird nops and after reading the thread which resulted in bringing back ERMS I see why https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@xxxxxxxxxxxxxx/ That said, whacking the 64 byte loop in favor of rep movsq is definitely the right to do and the patch below is one way to do it. What I don't get is what's up with ERMS availability on AMD CPUs. I was told newer uarchs support it, but the bit may be disabled in bios(?). Anyhow, I temporarily got an instance on Amazon EC2 with EPYC 7R13 and the bit is not there, whether this is a config problem or not. I also note there is quite a mess concerning other string ops, which I'm going to tackle in another thread later(tm). ================ cut here ================ Intel CPUs ship with ERMS for over a decade, but this is not true for AMD. Hand-rolled mov loops executing in this case are quite pessimal compared to rep movsq for bigger sizes. While the upper limit depends on uarch, everyone is well south of 1KB AFAICS and sizes bigger than that are common. The problem can be easily remedied so do it. Sample result from read1_processes from will-it-scale on EPYC 7R13 (4KB reads/s): before: 1507021 after: 1721828 (+14%) Note that the cutoff point for rep usage is set to 64 bytes, which is way too conservative but I'm sticking to what was done in 47ee3f1dd93b ("x86: re-introduce support for ERMS copies for user space accesses"). That is to say *some* copies will now go slower, which is fixable but beyond the scope of this patch. While here make sure labels are unique. Signed-off-by: Mateusz Guzik <mjguzik@xxxxxxxxx> --- arch/x86/lib/copy_user_64.S | 61 +++++++++++-------------------------- 1 file changed, 17 insertions(+), 44 deletions(-) diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index 01c5de4c279b..2fe61de81a22 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -68,55 +68,28 @@ SYM_FUNC_START(rep_movs_alternative) _ASM_EXTABLE_UA( 3b, .Lcopy_user_tail) .Llarge: -0: ALTERNATIVE "jmp .Lunrolled", "rep movsb", X86_FEATURE_ERMS -1: RET +4: ALTERNATIVE "jmp .Llarge_movsq", "rep movsb", X86_FEATURE_ERMS +5: RET - _ASM_EXTABLE_UA( 0b, 1b) + _ASM_EXTABLE_UA( 4b, 5b) - .p2align 4 -.Lunrolled: -10: movq (%rsi),%r8 -11: movq 8(%rsi),%r9 -12: movq 16(%rsi),%r10 -13: movq 24(%rsi),%r11 -14: movq %r8,(%rdi) -15: movq %r9,8(%rdi) -16: movq %r10,16(%rdi) -17: movq %r11,24(%rdi) -20: movq 32(%rsi),%r8 -21: movq 40(%rsi),%r9 -22: movq 48(%rsi),%r10 -23: movq 56(%rsi),%r11 -24: movq %r8,32(%rdi) -25: movq %r9,40(%rdi) -26: movq %r10,48(%rdi) -27: movq %r11,56(%rdi) - addq $64,%rsi - addq $64,%rdi - subq $64,%rcx - cmpq $64,%rcx - jae .Lunrolled - cmpl $8,%ecx - jae .Lword +.Llarge_movsq: + movq %rcx,%r8 + movq %rcx,%rax + shrq $3,%rcx + andl $7,%eax +6: rep movsq + movl %eax,%ecx testl %ecx,%ecx jne .Lcopy_user_tail RET - _ASM_EXTABLE_UA(10b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(11b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(12b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(13b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(14b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(15b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(16b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(17b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(20b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(21b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(22b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(23b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(24b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(25b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(26b, .Lcopy_user_tail) - _ASM_EXTABLE_UA(27b, .Lcopy_user_tail) +/* + * Recovery after failed rep movsq + */ +7: movq %r8,%rcx + jmp .Lcopy_user_tail + + _ASM_EXTABLE_UA( 6b, 7b) SYM_FUNC_END(rep_movs_alternative) EXPORT_SYMBOL(rep_movs_alternative) -- 2.39.2