On Tue, Jul 26, 2022 at 11:36:21AM -0700, Linus Torvalds wrote:
On Tue, Jul 26, 2022 at 11:18 AM Yury Norov <yury.norov@xxxxxxxxx> wrote:
We have find_bit_benchmark to check how it works in practice. Would
be great if someone with access to the hardware can share numbers.
Honestly, I doubt benchmarking find_bit in a loop is all that sensible.
Yes, that's what I was thinking - I've never seen it crop up in any of
the perf traces I've seen.
Nevertheless, here's some numbers from a single run of the
find_bit_benchmark module, kernel built with:
arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110
Current native implementation:
[ 46.184565]
Start testing find_bit() with random-filled bitmap
[ 46.195127] find_next_bit: 2440833 ns, 163112 iterations
[ 46.204226] find_next_zero_bit: 2372128 ns, 164569 iterations
[ 46.213152] find_last_bit: 2199779 ns, 163112 iterations
[ 46.299398] find_first_bit: 79526013 ns, 16234 iterations
[ 46.684026] find_first_and_bit: 377912990 ns, 32617 iterations
[ 46.692020] find_next_and_bit: 1269071 ns, 73562 iterations
[ 46.698745]
Start testing find_bit() with sparse bitmap
[ 46.705711] find_next_bit: 118652 ns, 656 iterations
[ 46.716621] find_next_zero_bit: 4183472 ns, 327025 iterations
[ 46.723395] find_last_bit: 50448 ns, 656 iterations
[ 46.762308] find_first_bit: 32190802 ns, 656 iterations
[ 46.769093] find_first_and_bit: 52129 ns, 1 iterations
[ 46.775882] find_next_and_bit: 62522 ns, 1 iterations
Generic implementation:
[ 25.149238]
Start testing find_bit() with random-filled bitmap
[ 25.160002] find_next_bit: 2640943 ns, 163537 iterations
[ 25.169567] find_next_zero_bit: 2838485 ns, 164144 iterations
[ 25.178595] find_last_bit: 2302372 ns, 163538 iterations
[ 25.204016] find_first_bit: 18697630 ns, 16373 iterations
[ 25.602571] find_first_and_bit: 391841480 ns, 32555 iterations
[ 25.610563] find_next_and_bit: 1260306 ns, 73587 iterations
[ 25.617295]
Start testing find_bit() with sparse bitmap
[ 25.624222] find_next_bit: 70289 ns, 656 iterations
[ 25.636478] find_next_zero_bit: 5527050 ns, 327025 iterations
[ 25.643253] find_last_bit: 52147 ns, 656 iterations
[ 25.657304] find_first_bit: 7328573 ns, 656 iterations
[ 25.664087] find_first_and_bit: 48518 ns, 1 iterations
[ 25.670871] find_next_and_bit: 59750 ns, 1 iterations
Overall, I would say it's pretty similar (some generic perform
marginally better, some native perform marginally better) with the
exception of find_first_bit() being much better with the generic
implementation, but find_next_zero_bit() being noticably worse.
So, pretty much nothing of any relevance between them, which may
come as a surprise given the byte vs word access differences between
the two implementations.
I suspect the reason behind that may be because the native
implementation code is smaller than the generic implementation,
outweighing the effects of the by-byte rather than by-word. I would
also suspect that, because of the smaller implementation, the native
version performs better in a I$-cool situation than the generic. Lastly,
I would suspect if we fixed the bug in the native version, and converted
it to use word loads, it would probably be better than the generic
version. I haven't anything to base that on other than gut feeling at
the moment, but I can make the changes to the native implementation and
see what effect that has, possibly tomorrow.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!