On Thu, Jun 18, 2020 at 04:39:35PM +0000, David Laight wrote: > From: Alexey Dobriyan > > Sent: 18 June 2020 14:17 > ... > > > > diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c > > > > index fff28c6f73a2..b0dfac3d3df7 100644 > > > > --- a/arch/x86/lib/usercopy_64.c > > > > +++ b/arch/x86/lib/usercopy_64.c > > > > @@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size) > > > > asm volatile( > > > > " testq %[size8],%[size8]\n" > > > > " jz 4f\n" > > > > + " .align 16\n" > > > > "0: movq $0,(%[dst])\n" > > > > " addq $8,%[dst]\n" > > > > " decl %%ecx ; jnz 0b\n" > > > > > > You can do better that that loop. > > > Change 'dst' to point to the end of the buffer, negate the count > > > and divide by 8 and you get: > > > "0: movq $0,($[dst],%%ecx,8)\n" > > > " add $1,%%ecx" > > > " jnz 0b\n" > > > which might run at one iteration per clock especially on cpu that pair > > > the add and jnz into a single uop. > > > (You need to use add not inc.) > > > > /dev/zero should probably use REP STOSB etc just like everything else. > > Almost certainly it shouldn't, and neither should anything else. > Potentially it could use whatever memset() is patched to. > That MIGHT be 'rep stos' on some cpu variants, but in general > it is slow. Yes, that's what I meant: alternatives choosing REP variant. memset loops are so 21-st century.