Re: [PATCH v2 4/4] __arch_xprod64(): make __always_inline when optimizing for performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, 7 Jul 2024, Arnd Bergmann wrote:

> On Sun, Jul 7, 2024, at 21:14, Nicolas Pitre wrote:
> > On Sun, 7 Jul 2024, Arnd Bergmann wrote:
> >
> >> On Sun, Jul 7, 2024, at 19:17, Nicolas Pitre wrote:
> >> > From: Nicolas Pitre <npitre@xxxxxxxxxxxx>
> >> >
> >> > Recent gcc versions started not systematically inline __arch_xprod64()
> >> > and that has performance implications. Give the compiler the freedom to
> >> > decide only when optimizing for size.
> >> >
> >> > Signed-off-by: Nicolas Pitre <npitre@xxxxxxxxxxxx>
> >> 
> >> Seems reasonable. Just to make sure: do you know if the non-inline
> >> version of xprod_64 ends up producing a more effecient division
> >> result than the __do_div64() code path on arch/arm?
> >
> > __arch_xprod_64() is part of the __do_div64() code path. So I'm not sure 
> > of your question.
> >
> > Obviously, having __arch_xprod_64() inlined is faster but it increases 
> > binary size.
> 
> I meant whether calling __div64_const32->__arch_xprod_64() is
> still faster for a constant base when the new __arch_xprod_64()
> is out of line, compared to the __div64_32->__do_div64()
> assembly code path we take for a non-constant base.

Oh, most likely yes. The non-constant base has to go through the whole 
one-bit-at-a-time division loop whereas the constant base with 
__div64_const32 results in 4 64-bits multiply and add. Moving 
__arch_xprod_64() out of line adds the argument shuffling overhead and 
it can't skip overflow handling, but still.

Here's some numbers. With latest patches using __always_inline:

test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.048285584s elapsed

Latest patches but __always_inline left out:

test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.053023584s elapsed

Forcing both constant and non-constant base through the same path:

test_div64: Starting 64bit/32bit division and modulo test
test_div64: Completed 64bit/32bit division and modulo test, 0.103263776s elapsed

It is worth noting that test_div64 does half the test with non constant 
divisors already so the impact is greater than what those numbers show.

And for what it is worth, those numbers were obtained using QEMU. The 
gcc version is 14.1.0.


Nicolas




[Index of Archives]     [Linux Kernel]     [Kernel Newbies]     [x86 Platform Driver]     [Netdev]     [Linux Wireless]     [Netfilter]     [Bugtraq]     [Linux Filesystems]     [Yosemite Discussion]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]

  Powered by Linux