Re: [PATCH 2/2] asm-generic/div64: reimplement __arch_xprod64()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 5 Jul 2024, Arnd Bergmann wrote:

> On Fri, Jul 5, 2024, at 04:20, Nicolas Pitre wrote:
> > From: Nicolas Pitre <npitre@xxxxxxxxxxxx>
> >
> > Several years later I just realized that this code could be optimized
> > and more importantly simplified even further. With some reordering, it
> > is possible to dispense with overflow handling entirely and still have
> > optimal code.
> >
> > There is also no longer a reason to have the possibility for
> > architectures to override the generic version. Only ARM did it and these
> > days the compiler does a better job than the hand-crafted assembly
> > version anyway.
> >
> > Kernel binary gets slightly smaller as well. Using the ARM's
> > versatile_defconfig plus CONFIG_TEST_DIV64=y:
> >
> > Before this patch:
> >
> >    text    data     bss     dec     hex filename
> > 9644668 2743926  193424 12582018         bffc82 vmlinux
> >
> > With this patch:
> >
> >    text    data     bss     dec     hex filename
> > 9643572 2743926  193424 12580922         bff83a vmlinux
> >
> > Signed-off-by: Nicolas Pitre <npitre@xxxxxxxxxxxx>
> 
> This looks really nice, thanks for the work!
> 
> I've tried reproducing your finding to see what compiler
> version started being good enough to benefit from the
> new version. Looking at just the vmlinux size as you did
> above, I can confirm that the generated code is noticeably
> smaller in gcc-11 and above, slightly smaller in gcc-10
> but larger in gcc-9 and below.

Well well... Turns out that binary size is a bad metric here. The main 
reason why the compiled code gets smaller is because gcc decides to 
_not_ inline __arch_xprod_64(). That makes the kernel smaller, but a 
bunch of conditionals in there were really meant to be resolved at 
compile time in order to generate the best code for each instance. With 
a non inlined version, those conditionals are no longer based on 
constants and the compiler emits code to determine at runtime if 2 or 3 
instructions can be saved, which completely defeats the purpose in 
addition to make performance worse.

So I've reworked it all again, this time taking into account the 
possibility for the compiler not to inline that code sometimes. Plus 
some more simplifications.


Nicolas




[Index of Archives]     [Linux Kernel]     [Kernel Newbies]     [x86 Platform Driver]     [Netdev]     [Linux Wireless]     [Netfilter]     [Bugtraq]     [Linux Filesystems]     [Yosemite Discussion]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]

  Powered by Linux