Re: [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes

Alexey Dobriyan <adobriyan@xxxxxxxxx> · Fri, 19 Jun 2020 00:01:51 +0300

On Thu, Jun 18, 2020 at 04:39:35PM +0000, David Laight wrote:
> From: Alexey Dobriyan 
> > Sent: 18 June 2020 14:17
> ...
> > > > diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
> > > > index fff28c6f73a2..b0dfac3d3df7 100644
> > > > --- a/arch/x86/lib/usercopy_64.c
> > > > +++ b/arch/x86/lib/usercopy_64.c
> > > > @@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
> > > >  	asm volatile(
> > > >  		"	testq  %[size8],%[size8]\n"
> > > >  		"	jz     4f\n"
> > > > +		"	.align 16\n"
> > > >  		"0:	movq $0,(%[dst])\n"
> > > >  		"	addq   $8,%[dst]\n"
> > > >  		"	decl %%ecx ; jnz   0b\n"
> > >
> > > You can do better that that loop.
> > > Change 'dst' to point to the end of the buffer, negate the count
> > > and divide by 8 and you get:
> > > 		"0:	movq $0,($[dst],%%ecx,8)\n"
> > > 		"	add $1,%%ecx"
> > > 		"	jnz 0b\n"
> > > which might run at one iteration per clock especially on cpu that pair
> > > the add and jnz into a single uop.
> > > (You need to use add not inc.)
> > 
> > /dev/zero should probably use REP STOSB etc just like everything else.
> 
> Almost certainly it shouldn't, and neither should anything else.
> Potentially it could use whatever memset() is patched to.
> That MIGHT be 'rep stos' on some cpu variants, but in general
> it is slow.

Yes, that's what I meant: alternatives choosing REP variant.
memset loops are so 21-st century.