Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32

Nicholas Piggin <npiggin@xxxxxxxxx> · Fri, 18 May 2018 01:03:47 +1000

On Thu, 17 May 2018 16:21:17 +0200
Christophe LEROY <christophe.leroy@xxxxxx> wrote:

> Le 17/05/2018 à 15:46, Michael Ellerman a écrit :
> > Nicholas Piggin <npiggin@xxxxxxxxx> writes:
> >   
> >> On Thu, 17 May 2018 12:04:13 +0200 (CEST)
> >> Christophe Leroy <christophe.leroy@xxxxxx> wrote:
> >>  
> >>> commit 87a156fb18fe1 ("Align hot loops of some string functions")
> >>> degraded the performance of string functions by adding useless
> >>> nops
> >>>
> >>> A simple benchmark on an 8xx calling 100000x a memchr() that
> >>> matches the first byte runs in 41668 TB ticks before this patch
> >>> and in 35986 TB ticks after this patch. So this gives an
> >>> improvement of approx 10%
> >>>
> >>> Another benchmark doing the same with a memchr() matching the 128th
> >>> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
> >>> after this patch, so regardless on the number of loops, removing
> >>> those useless nops improves the test by 5683 TB ticks.
> >>>
> >>> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
> >>> Signed-off-by: Christophe Leroy <christophe.leroy@xxxxxx>
> >>> ---
> >>>   Was sent already as part of a serie optimising string functions.
> >>>   Resending on itself as it is independent of the other changes in the
> >>> serie
> >>>
> >>>   arch/powerpc/lib/string.S | 6 ++++++
> >>>   1 file changed, 6 insertions(+)
> >>>
> >>> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
> >>> index a787776822d8..a026d8fa8a99 100644
> >>> --- a/arch/powerpc/lib/string.S
> >>> +++ b/arch/powerpc/lib/string.S
> >>> @@ -23,7 +23,9 @@ _GLOBAL(strncpy)
> >>>   	mtctr	r5
> >>>   	addi	r6,r3,-1
> >>>   	addi	r4,r4,-1
> >>> +#ifdef CONFIG_PPC64
> >>>   	.balign 16
> >>> +#endif
> >>>   1:	lbzu	r0,1(r4)
> >>>   	cmpwi	0,r0,0
> >>>   	stbu	r0,1(r6)  
> >>
> >> The ifdefs are a bit ugly, but you can't argue with the numbers. These
> >> alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
> >> the ifetch performance when you have such a loop (although there is
> >> always a tradeoff for a single iteration).
> >>
> >> Would it make sense to define that for 32-bit as well, and you could use
> >> it here instead of the ifdefs? Small CPUs could just use 0.  
> > 
> > Can we do it with a macro in the header, eg. like:
> > 
> > #ifdef CONFIG_PPC64
> > #define IFETCH_BALIGN	.balign IFETCH_ALIGN_BYTES
> > #endif
> > 
> > ...
> > 
> >    	addi	r4,r4,-1
> >    	IFETCH_BALIGN
> >    1:	lbzu	r0,1(r4)
> > 
> >   
> 
> Why not just define IFETCH_ALIGN_SHIFT for PPC32 as well in asm/cache.h 
> ?, then replace the .balign 16 by .balign IFETCH_ALIGN_BYTES (or .align 
> IFETCH_ALIGN_SHIFT) ?

Yeah that's what I was thinking. I would do that.

Thanks,
Nick