Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch

Yury Norov <yury.norov@xxxxxxxxx> · Mon, 29 Apr 2024 09:34:06 -0700

On Mon, Apr 29, 2024 at 11:40:34PM +0800, Kuan-Wei Chiu wrote:
> On Mon, Apr 29, 2024 at 10:05:12AM +0200, Rasmus Villemoes wrote:
> > On 28/04/2024 18.08, Yury Norov wrote:
> > > + Rasmus
> > > 
> > > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote:
> > >> Before:
> > >>                Start testing find_bit() with random-filled bitmap
> > >> [    0.299085] fbcon: Taking over console
> > >> [    0.299820] find_next_bit:                  606286 ns, 164169 iterations
> > >> [    0.300463] find_next_zero_bit:             641072 ns, 163512 iterations
> > >> [    0.300996] find_last_bit:                  531027 ns, 164169 iterations
> > >> [    0.305233] find_nth_bit:                  4235859 ns,  16454 iterations
> > >> [    0.306434] find_first_bit:                1199357 ns,  16455 iterations
> > >> [    0.321616] find_first_and_bit:           15179667 ns,  32869 iterations
> > >> [    0.321917] find_next_and_bit:              298836 ns,  73875 iterations
> > >> [    0.321918] 
> > >>                Start testing find_bit() with sparse bitmap
> > >> [    0.321953] find_next_bit:                    7931 ns,    656 iterations
> > >> [    0.323201] find_next_zero_bit:            1246980 ns, 327025 iterations
> > >> [    0.323210] find_last_bit:                    8000 ns,    656 iterations
> > >> [    0.324427] find_nth_bit:                  1213161 ns,    655 iterations
> > >> [    0.324813] find_first_bit:                 384747 ns,    656 iterations
> > >> [    0.324817] find_first_and_bit:               2220 ns,      1 iterations
> > >> [    0.324820] find_next_and_bit:                1831 ns,      1 iterations
> > >>
> > >> After:
> > >>                Start testing find_bit() with random-filled bitmap
> > >> [    0.305081] fbcon: Taking over console
> > >> [    0.306126] find_next_bit:                  854517 ns, 163960 iterations
> > >> [    0.307041] find_next_zero_bit:             911725 ns, 163721 iterations
> > >> [    0.307711] find_last_bit:                  668261 ns, 163960 iterations
> > >> [    0.311160] find_nth_bit:                  3447530 ns,  16372 iterations
> > >> [    0.312358] find_first_bit:                1196633 ns,  16373 iterations
> > >> [    0.327191] find_first_and_bit:           14830129 ns,  32951 iterations
> > >> [    0.327503] find_next_and_bit:              310560 ns,  73719 iterations
> > >> [    0.327504] 
> > >>                Start testing find_bit() with sparse bitmap
> > >> [    0.327539] find_next_bit:                    7633 ns,    656 iterations
> > >> [    0.328787] find_next_zero_bit:            1247398 ns, 327025 iterations
> > >> [    0.328797] find_last_bit:                    8425 ns,    656 iterations
> > >> [    0.330034] find_nth_bit:                  1234044 ns,    655 iterations
> > >> [    0.330428] find_first_bit:                 392086 ns,    656 iterations
> > >> [    0.330431] find_first_and_bit:               1980 ns,      1 iterations
> > >> [    0.330434] find_next_and_bit:                1831 ns,      1 iterations
> > >>
> > >> Some benchmarks seem to have worsened after applying this patch.
> > >> However, unless I'm mistaken, the fns() changes should only affect the
> > >> results of find_nth_bit, while the others are just random fluctuations.
> > > 
> > > So...
> > > 
> > > The patch itself looks good, 
> > 
> > Well, I think it could be even better. While I agree that bit=ffs();
> > clear_bit(bit, ) is probably a bad way of doing it, I think the basic
> > structure of the function is good. Introducing a "count from 0 up to n"
> > loop is rarely a good thing, keeping the n counting down to 0 is likely
> > better.
> > 
> > So I'd instead just change the function to
> > 
> > static inline unsigned long fns(unsigned long word, unsigned int n)
> > {
> > 	while (word) {
> > 		if (n-- == 0)
> > 			return __ffs(word);
> > 		word &= word - 1;
> > 	}
> > 
> > 	return BITS_PER_LONG;
> > }
> >
> How about rewriting it as follows:
> 
> static inline unsigned long fns(unsigned long word, unsigned int n)
> {
>     while (word && n--)
>         word &= word - 1;
> 
>     return word ? __ffs(word) : BITS_PER_LONG;
> }
> 
> IMHO, this way the code can be shorter and more elegant.

Rasmus' code looks shorter because it tests the 'word != 0' only once
when n == 0, for example. But we never sure how the compiler would
optimize it. Can you show a disassembly of both and pick the best?

Thanks,
Yury