On Sun, 7 Jul 2024, Arnd Bergmann wrote: > On Sun, Jul 7, 2024, at 21:14, Nicolas Pitre wrote: > > On Sun, 7 Jul 2024, Arnd Bergmann wrote: > > > >> On Sun, Jul 7, 2024, at 19:17, Nicolas Pitre wrote: > >> > From: Nicolas Pitre <npitre@xxxxxxxxxxxx> > >> > > >> > Recent gcc versions started not systematically inline __arch_xprod64() > >> > and that has performance implications. Give the compiler the freedom to > >> > decide only when optimizing for size. > >> > > >> > Signed-off-by: Nicolas Pitre <npitre@xxxxxxxxxxxx> > >> > >> Seems reasonable. Just to make sure: do you know if the non-inline > >> version of xprod_64 ends up producing a more effecient division > >> result than the __do_div64() code path on arch/arm? > > > > __arch_xprod_64() is part of the __do_div64() code path. So I'm not sure > > of your question. > > > > Obviously, having __arch_xprod_64() inlined is faster but it increases > > binary size. > > I meant whether calling __div64_const32->__arch_xprod_64() is > still faster for a constant base when the new __arch_xprod_64() > is out of line, compared to the __div64_32->__do_div64() > assembly code path we take for a non-constant base. Oh, most likely yes. The non-constant base has to go through the whole one-bit-at-a-time division loop whereas the constant base with __div64_const32 results in 4 64-bits multiply and add. Moving __arch_xprod_64() out of line adds the argument shuffling overhead and it can't skip overflow handling, but still. Here's some numbers. With latest patches using __always_inline: test_div64: Starting 64bit/32bit division and modulo test test_div64: Completed 64bit/32bit division and modulo test, 0.048285584s elapsed Latest patches but __always_inline left out: test_div64: Starting 64bit/32bit division and modulo test test_div64: Completed 64bit/32bit division and modulo test, 0.053023584s elapsed Forcing both constant and non-constant base through the same path: test_div64: Starting 64bit/32bit division and modulo test test_div64: Completed 64bit/32bit division and modulo test, 0.103263776s elapsed It is worth noting that test_div64 does half the test with non constant divisors already so the impact is greater than what those numbers show. And for what it is worth, those numbers were obtained using QEMU. The gcc version is 14.1.0. Nicolas