Re: Linux 5.19-rc8

"Russell King (Oracle)" <linux@xxxxxxxxxxxxxxx> · Mon, 1 Aug 2022 16:48:21 +0100

Oh FFS.

I see you decided off your own back to remove the ARM version of the
find_bit functions, with NO agreement from the arch maintainer. This
is not on.

On Sat, Jul 30, 2022 at 02:38:38PM -0700, Yury Norov wrote:
On Wed, Jul 27, 2022 at 08:43:22AM +0100, Russell King (Oracle) wrote:
On Tue, Jul 26, 2022 at 06:33:55PM -0700, Yury Norov wrote:
On Tue, Jul 26, 2022 at 5:15 PM Russell King (Oracle)
<linux@xxxxxxxxxxxxxxx> wrote:

On Tue, Jul 26, 2022 at 01:20:23PM -0700, Linus Torvalds wrote:
On Tue, Jul 26, 2022 at 12:44 PM Russell King (Oracle)
<linux@xxxxxxxxxxxxxxx> wrote:

Overall, I would say it's pretty similar (some generic perform
marginally better, some native perform marginally better) with the
exception of find_first_bit() being much better with the generic
implementation, but find_next_zero_bit() being noticably worse.

The generic _find_first_bit() code is actually sane and simple. It
loops over words until it finds a non-zero one, and then does trivial
calculations on that last word.

That explains why the generic code does so much better than your byte-wise asm.

In contrast, the generic _find_next_bit() I find almost offensively
silly - which in turn explains why your byte-wide asm does better.

I think the generic _find_next_bit() should actually do what the m68k
find_next_bit code does: handle the first special word itself, and
then just call find_first_bit() on the rest of it.

And it should *not* try to handle the dynamic "bswap and/or bit sense
invert" thing at all. That should be just four different (trivial)
cases for the first word.

Here's the results for the native version converted to use word loads:

[   37.319937]
               Start testing find_bit() with random-filled bitmap
[   37.330289] find_next_bit:                 2222703 ns, 163781 iterations
[   37.339186] find_next_zero_bit:            2154375 ns, 163900 iterations
[   37.348118] find_last_bit:                 2208104 ns, 163780 iterations
[   37.372564] find_first_bit:               17722203 ns,  16370 iterations
[   37.737415] find_first_and_bit:          358135191 ns,  32453 iterations
[   37.745420] find_next_and_bit:             1280537 ns,  73644 iterations
[   37.752143]
               Start testing find_bit() with sparse bitmap
[   37.759032] find_next_bit:                   41256 ns,    655 iterations
[   37.769905] find_next_zero_bit:            4148410 ns, 327026 iterations
[   37.776675] find_last_bit:                   48742 ns,    655 iterations
[   37.790961] find_first_bit:                7562371 ns,    655 iterations
[   37.797743] find_first_and_bit:              47366 ns,      1 iterations
[   37.804527] find_next_and_bit:               59924 ns,      1 iterations

which is generally faster than the generic version, with the exception
of the sparse find_first_bit (generic was:
[   25.657304] find_first_bit:                7328573 ns,    656 iterations)

find_next_{,zero_}bit() in the sparse case are quite a bit faster than
the generic code.

Look at find_{first,next}_and_bit results. Those two have no arch version
and in both cases use generic code. In theory they should be equally fast
before and after, but your testing says that generic case is slower even
for them, and the difference is comparable with real arch functions numbers.
It makes me feel like:
 - there's something unrelated, like governor/throttling that affect results;
 - the numbers are identical, taking the dispersion into account.

If the difference really concerns you, I'd suggest running the test
several times
to measure confidence intervals.

Given that the benchmark is run against random bitmaps and with
interrupts enabled, there is going to be noise in the results.

Here's the second run:

[26234.429389]
               Start testing find_bit() with random-filled bitmap
[26234.439722] find_next_bit:                 2206687 ns, 164277 iterations
[26234.448664] find_next_zero_bit:            2188368 ns, 163404 iterations
[26234.457612] find_last_bit:                 2223742 ns, 164278 iterations
[26234.482056] find_first_bit:               17720726 ns,  16384 iterations
[26234.859374] find_first_and_bit:          370602019 ns,  32877 iterations
[26234.867379] find_next_and_bit:             1280651 ns,  74091 iterations
[26234.874107]
               Start testing find_bit() with sparse bitmap
[26234.881014] find_next_bit:                   46142 ns,    656 iterations
[26234.891900] find_next_zero_bit:            4158987 ns, 327025 iterations
[26234.898672] find_last_bit:                   49727 ns,    656 iterations
[26234.912504] find_first_bit:                7107862 ns,    656 iterations
[26234.919290] find_first_and_bit:              52092 ns,      1 iterations
[26234.926076] find_next_and_bit:               60856 ns,      1 iterations

And a third run:

[26459.679524]
               Start testing find_bit() with random-filled bitmap
[26459.689871] find_next_bit:                 2199418 ns, 163311 iterations
[26459.698798] find_next_zero_bit:            2181289 ns, 164370 iterations
[26459.707738] find_last_bit:                 2213638 ns, 163311 iterations
[26459.732224] find_first_bit:               17764152 ns,  16429 iterations
[26460.133823] find_first_and_bit:          394886375 ns,  32672 iterations
[26460.141818] find_next_and_bit:             1269693 ns,  73485 iterations
[26460.148545]
               Start testing find_bit() with sparse bitmap
[26460.155433] find_next_bit:                   40753 ns,    653 iterations
[26460.166307] find_next_zero_bit:            4148211 ns, 327028 iterations
[26460.173078] find_last_bit:                   50017 ns,    653 iterations
[26460.187007] find_first_bit:                7205325 ns,    653 iterations
[26460.193790] find_first_and_bit:              49358 ns,      1 iterations
[26460.200577] find_next_and_bit:               62332 ns,      1 iterations

My gut feeling is that yes, there is some variance, but not on an
order that is significant that would allow us to say "there's no
difference".

find_next_bit results for random are: 2222703, 2206687, 2199418,
which is an average of 2209603 and a variance of around 0.5%.
The difference between this and the single generic figure I have
is on the order of 20%.

I'll do the same with find_first_bit for random: 17722203, 17720726,
and 17764152. Average is 17735694. Variance is around 0.1% or 0.2%.
The difference between this and the single generic figure I have is
on the order of 5%. Not so large, but still quite a big difference
compared to the variance.

find_first_bit for sparse: 7562371, 7107862, 7205325. Average is
7291853. Variance is higher at about 4%. Difference between this and
the generic figure is 0.5%, so this one is not significantly
different.

The best result looks to be find_next_zero_bit for the sparse bitmap
case. The generic code measures 5.5ms, the native code is sitting
around 4.1ms. That's a difference of around 34%, and by just looking
at the range in the figures above we can see this is a significant
result without needing to do the calculations. Similar is true of
find_next_bit for the sparse bitmap.

So, I think the results are significant in most cases and variance
doesn't account for the differences. The only one which isn't is
find_first_bit for the sparse case.

Hi Russel,

+ Alexey Klimov <klimov.linux@xxxxxxxxx>

This is my testings for native vs generic find_bit operations on a15
and 17.

The raw numbers are collected by Alexey Klimov on Odroid-xu3. All cpu
frequencies were fixed at 1000Mhz. (Thanks a lot!)

For each native/generic @ a15/a7 configuration, the find_bit_benchmark 
was run 5 times, and the results are summarized below:

A15                      Native     Generic       Difference
Dense                        ns          ns       %   sigmas
find_next_bit:          3746929     3231641      14      8.3
find_next_zero_bit:     3935354     3202608      19     10.4
find_last_bit:          3134713     3129717       0      0.1
find_first_bit:        85626542    20498669      76    172.4
find_first_and_bit:   409252997   414820417      -1     -0.2
find_next_and_bit:      1678706     1654420       1      0.4

Sparse                                        
find_next_bit:          143208        77924      46     29.4
find_next_zero_bit:    6893375      6059177      12     14.3
find_last_bit:           67174        68616      -2     -1.0
find_first_bit:       33689256      8151493      76     47.8
find_first_and_bit:     124758       156974     -26     -1.3
find_next_and_bit:       53391        56716      -6     -0.2

A7                      Native      Generic       Difference
Dense                       ns           ns       %   sigmas
find_next_bit:         4207627      5532764     -31    -14.9
find_next_zero_bit:    4259961      5236880     -23    -10.0
find_last_bit:         4281386      4201025       2      1.5
find_first_bit:      236913620     50970424      78    163.3
find_first_and_bit:  728069762    750580781      -3     -0.7
find_next_and_bit:     2696263      2766077      -3     -0.9

Sparse
find_next_bit:          327241       143776      56     40.7
find_next_zero_bit:    6987249     10235989     -46    -21.8
find_last_bit:           97758        94725       3      0.6
find_first_bit:       94628040     21051964      78     41.8
find_first_and_bit:     248133       241267       3      0.3
find_next_and_bit:      136475       154000     -13     -0.5

The last column is the difference between native and generic code
performance normalized to a standard deviation:
        (mean(native) - mean(generic)) / max(std(native), std(generic))

The results look consistent to me because 'and' subtests that are always
generic differ by less than one sigma.

On A15 generic code is a clear winner. On A7 results are inconsistent
although significant. Maybe it's worth to retest on A7.

Regarding the choice between native and generic core - I would prefer
generic version even if it's slightly slower because it is tested and
maintained better. And because the results of the test are at least on
par, to me it's a no-brainer.

Would be really interesting to compare performance of your LDRB->LDR
patch with the generic code using the same procedure.

Thanks,
Yury

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!