On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote:
On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
<linux@xxxxxxxxxxxxxxx> wrote:
On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
<linux@xxxxxxxxxxxxxxx> wrote:
Overall, I would say it's pretty similar (some generic perform
marginally better, some native perform marginally better) with the
exception of find_first_bit() being much better with the generic
implementation, but find_next_zero_bit() being noticably worse.
The generic _find_first_bit() code is actually sane and simple. It
loops over words until it finds a non-zero one, and then does trivial
calculations on that last word.
That explains why the generic code does so much better than your byte-wise asm.
In contrast, the generic _find_next_bit() I find almost offensively
silly - which in turn explains why your byte-wide asm does better.
I think the generic _find_next_bit() should actually do what the m68k
find_next_bit code does: handle the first special word itself, and
then just call find_first_bit() on the rest of it.
And it should *not* try to handle the dynamic "bswap and/or bit sense
invert" thing at all. That should be just four different (trivial)
cases for the first word.
Here's the results for the native version converted to use word loads:
[ 37.319937]
Start testing find_bit() with random-filled bitmap
[ 37.330289] find_next_bit: 2222703 ns, 163781 iterations
[ 37.339186] find_next_zero_bit: 2154375 ns, 163900 iterations
[ 37.348118] find_last_bit: 2208104 ns, 163780 iterations
[ 37.372564] find_first_bit: 17722203 ns, 16370 iterations
[ 37.737415] find_first_and_bit: 358135191 ns, 32453 iterations
[ 37.745420] find_next_and_bit: 1280537 ns, 73644 iterations
[ 37.752143]
Start testing find_bit() with sparse bitmap
[ 37.759032] find_next_bit: 41256 ns, 655 iterations
[ 37.769905] find_next_zero_bit: 4148410 ns, 327026 iterations
[ 37.776675] find_last_bit: 48742 ns, 655 iterations
[ 37.790961] find_first_bit: 7562371 ns, 655 iterations
[ 37.797743] find_first_and_bit: 47366 ns, 1 iterations
[ 37.804527] find_next_and_bit: 59924 ns, 1 iterations
which is generally faster than the generic version, with the exception
of the sparse find_first_bit (generic was:
[ 25.657304] find_first_bit: 7328573 ns, 656 iterations)
find_next_{,zero_}bit() in the sparse case are quite a bit faster than
the generic code.
Look at find_{first,next}_and_bit results. Those two have no arch version
and in both cases use generic code. In theory they should be equally fast
before and after, but your testing says that generic case is slower even
for them, and the difference is comparable with real arch functions numbers.
It makes me feel like:
- there's something unrelated, like governor/throttling that affect results;
- the numbers are identical, taking the dispersion into account.
If the difference really concerns you, I'd suggest running the test
several times
to measure confidence intervals.
Given that the benchmark is run against random bitmaps and with
interrupts enabled, there is going to be noise in the results.
Here's the second run:
[26234.429389]
Start testing find_bit() with random-filled bitmap
[26234.439722] find_next_bit: 2206687 ns, 164277 iterations
[26234.448664] find_next_zero_bit: 2188368 ns, 163404 iterations
[26234.457612] find_last_bit: 2223742 ns, 164278 iterations
[26234.482056] find_first_bit: 17720726 ns, 16384 iterations
[26234.859374] find_first_and_bit: 370602019 ns, 32877 iterations
[26234.867379] find_next_and_bit: 1280651 ns, 74091 iterations
[26234.874107]
Start testing find_bit() with sparse bitmap
[26234.881014] find_next_bit: 46142 ns, 656 iterations
[26234.891900] find_next_zero_bit: 4158987 ns, 327025 iterations
[26234.898672] find_last_bit: 49727 ns, 656 iterations
[26234.912504] find_first_bit: 7107862 ns, 656 iterations
[26234.919290] find_first_and_bit: 52092 ns, 1 iterations
[26234.926076] find_next_and_bit: 60856 ns, 1 iterations
And a third run:
[26459.679524]
Start testing find_bit() with random-filled bitmap
[26459.689871] find_next_bit: 2199418 ns, 163311 iterations
[26459.698798] find_next_zero_bit: 2181289 ns, 164370 iterations
[26459.707738] find_last_bit: 2213638 ns, 163311 iterations
[26459.732224] find_first_bit: 17764152 ns, 16429 iterations
[26460.133823] find_first_and_bit: 394886375 ns, 32672 iterations
[26460.141818] find_next_and_bit: 1269693 ns, 73485 iterations
[26460.148545]
Start testing find_bit() with sparse bitmap
[26460.155433] find_next_bit: 40753 ns, 653 iterations
[26460.166307] find_next_zero_bit: 4148211 ns, 327028 iterations
[26460.173078] find_last_bit: 50017 ns, 653 iterations
[26460.187007] find_first_bit: 7205325 ns, 653 iterations
[26460.193790] find_first_and_bit: 49358 ns, 1 iterations
[26460.200577] find_next_and_bit: 62332 ns, 1 iterations
My gut feeling is that yes, there is some variance, but not on an
order that is significant that would allow us to say "there's no
difference".
find_next_bit results for random are: 2222703, 2206687, 2199418,
which is an average of 2209603 and a variance of around 0.5%.
The difference between this and the single generic figure I have
is on the order of 20%.
I'll do the same with find_first_bit for random: 17722203, 17720726,
and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
The difference between this and the single generic figure I have is
on the order of 5%. Not so large, but still quite a big difference
compared to the variance.
find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
7291853. Variance is higher at about 4%. Difference between this and
the generic figure is 0.5%, so this one is not significantly
different.
The best result looks to be find_next_zero_bit for the sparse bitmap
case. The generic code measures 5.5ms, the native code is sitting
around 4.1ms. That's a difference of around 34%, and by just looking
at the range in the figures above we can see this is a significant
result without needing to do the calculations. Similar is true of
find_next_bit for the sparse bitmap.
So, I think the results are significant in most cases and variance
doesn't account for the differences. The only one which isn't is
find_first_bit for the sparse case.
Hi Russel,
+ Alexey Klimov <klimov.linux@xxxxxxxxx>
This is my testings for native vs generic find_bit operations on a15
and 17.
The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu
frequencies were fixed at 1000Mhz. (Thanks a lot!)
For each native/generic @ a15/a7 configuration, the find_bit_benchmark
was run 5 times, and the results are summarized below:
A15 Native Generic Difference
Dense ns ns % sigmas
find_next_bit: 3746929 3231641 14 8.3
find_next_zero_bit: 3935354 3202608 19 10.4
find_last_bit: 3134713 3129717 0 0.1
find_first_bit: 85626542 20498669 76 172.4
find_first_and_bit: 409252997 414820417 -1 -0.2
find_next_and_bit: 1678706 1654420 1 0.4
Sparse
find_next_bit: 143208 77924 46 29.4
find_next_zero_bit: 6893375 6059177 12 14.3
find_last_bit: 67174 68616 -2 -1.0
find_first_bit: 33689256 8151493 76 47.8
find_first_and_bit: 124758 156974 -26 -1.3
find_next_and_bit: 53391 56716 -6 -0.2
A7 Native Generic Difference
Dense ns ns % sigmas
find_next_bit: 4207627 5532764 -31 -14.9
find_next_zero_bit: 4259961 5236880 -23 -10.0
find_last_bit: 4281386 4201025 2 1.5
find_first_bit: 236913620 50970424 78 163.3
find_first_and_bit: 728069762 750580781 -3 -0.7
find_next_and_bit: 2696263 2766077 -3 -0.9
Sparse
find_next_bit: 327241 143776 56 40.7
find_next_zero_bit: 6987249 10235989 -46 -21.8
find_last_bit: 97758 94725 3 0.6
find_first_bit: 94628040 21051964 78 41.8
find_first_and_bit: 248133 241267 3 0.3
find_next_and_bit: 136475 154000 -13 -0.5
The last column is the difference between native and generic code
performance normalized to a standard deviation:
(mean(native) - mean(generic)) / max(std(native), std(generic))
The results look consistent to me because 'and' subtests that are always
generic differ by less than one sigma.
On A15 generic code is a clear winner. On A7 results are inconsistent
although significant. Maybe it's worth to retest on A7.
Regarding the choice between native and generic core - I would prefer
generic version even if it's slightly slower because it is tested and
maintained better. And because the results of the test are at least on
par, to me it's a no-brainer.
Would be really interesting to compare performance of your LDRB->LDR
patch with the generic code using the same procedure.
Thanks,
Yury