Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch

Kuan-Wei Chiu <visitorckw@xxxxxxxxx> · Sat, 27 Apr 2024 13:33:55 +0800

On Fri, Apr 26, 2024 at 12:48:48PM -0700, Yury Norov wrote:
> On Fri, Apr 26, 2024 at 12:08 PM Andrew Morton
> <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> >
> > The patch titled
> >      Subject: bitops: optimize fns() for improved performance
> > has been added to the -mm mm-nonmm-unstable branch.  Its filename is
> >      bitops-optimize-fns-for-improved-performance.patch
> >
> > This patch will shortly appear at
> >      https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/bitops-optimize-fns-for-improved-performance.patch
> >
> > This patch will later appear in the mm-nonmm-unstable branch at
> >     git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> >
> > Before you just go and hit "reply", please:
> >    a) Consider who else should be cc'ed
> >    b) Prefer to cc a suitable mailing list as well
> >    c) Ideally: find the original patch on the mailing list and do a
> >       reply-to-all to that, adding suitable additional cc's
> >
> > *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
> >
> > The -mm tree is included into linux-next via the mm-everything
> > branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > and is updated there every 2-3 working days
> >
> > ------------------------------------------------------
> > From: Kuan-Wei Chiu <visitorckw@xxxxxxxxx>
> > Subject: bitops: optimize fns() for improved performance
> > Date: Fri, 26 Apr 2024 11:51:52 +0800
> >
> > The current fns() repeatedly uses __ffs() to find the index of the least
> > significant bit and then clears the corresponding bit using __clear_bit().
> > The method for clearing the least significant bit can be optimized by
> > using word &= word - 1 instead.
> >
> > Typically, the execution time of one __ffs() plus one __clear_bit() is
> > longer than that of a bitwise AND operation and a subtraction.  To improve
> > performance, the loop for clearing the least significant bit has been
> > replaced with word &= word - 1, followed by a single __ffs() operation to
> > obtain the answer.  This change reduces the number of __ffs() iterations
> > from n to just one, enhancing overall performance.
> >
> > The following microbenchmark data, conducted on my x86-64 machine, shows
> > the execution time (in microseconds) required for 1000000 test data
> > generated by get_random_u64() and executed by fns() under different values
> > of n:
> >
> > +-----+---------------+---------------+
> > |  n  |   time_old    |   time_new    |
> > +-----+---------------+---------------+
> > |  0  |     29194     |     25878     |
> > |  1  |     25510     |     25497     |
> > |  2  |     27836     |     25721     |
> > |  3  |     30140     |     25673     |
> > |  4  |     32569     |     25426     |
> > |  5  |     34792     |     25690     |
> > |  6  |     37117     |     25651     |
> > |  7  |     39742     |     25383     |
> > |  8  |     42360     |     25657     |
> > |  9  |     44672     |     25897     |
> > | 10  |     47237     |     25819     |
> > | 11  |     49884     |     26530     |
> > | 12  |     51864     |     26647     |
> > | 13  |     54265     |     28915     |
> > | 14  |     56440     |     28373     |
> > | 15  |     58839     |     28616     |
> > | 16  |     62383     |     29128     |
> > | 17  |     64257     |     30041     |
> > | 18  |     66805     |     29773     |
> > | 19  |     69368     |     33203     |
> > | 20  |     72942     |     33688     |
> > | 21  |     77006     |     34518     |
> > | 22  |     80926     |     34298     |
> > | 23  |     85723     |     35586     |
> > | 24  |     90324     |     36376     |
> > | 25  |     95992     |     37465     |
> > | 26  |    101101     |     37599     |
> > | 27  |    106520     |     37466     |
> > | 28  |    113287     |     38163     |
> > | 29  |    120552     |     38810     |
> > | 30  |    128040     |     39373     |
> > | 31  |    135624     |     40500     |
> > | 32  |    142580     |     40343     |
> > | 33  |    148915     |     40460     |
> > | 34  |    154005     |     41294     |
> > | 35  |    157996     |     41730     |
> > | 36  |    160806     |     41523     |
> > | 37  |    162975     |     42088     |
> > | 38  |    163426     |     41530     |
> > | 39  |    164872     |     41789     |
> > | 40  |    164477     |     42505     |
> > | 41  |    164758     |     41879     |
> > | 42  |    164182     |     41415     |
> > | 43  |    164842     |     42119     |
> > | 44  |    164881     |     42297     |
> > | 45  |    164870     |     42145     |
> > | 46  |    164673     |     42066     |
> > | 47  |    164616     |     42051     |
> > | 48  |    165055     |     41902     |
> > | 49  |    164847     |     41862     |
> > | 50  |    165171     |     41960     |
> > | 51  |    164851     |     42089     |
> > | 52  |    164763     |     41717     |
> > | 53  |    164635     |     42154     |
> > | 54  |    164757     |     41983     |
> > | 55  |    165095     |     41419     |
> > | 56  |    164641     |     42381     |
> > | 57  |    164601     |     41654     |
> > | 58  |    164864     |     41834     |
> > | 59  |    164594     |     41920     |
> > | 60  |    165207     |     42020     |
> > | 61  |    165056     |     41185     |
> > | 62  |    165160     |     41722     |
> > | 63  |    164923     |     41702     |
> > | 64  |    164777     |     41880     |
> > +-----+---------------+---------------+
> 
> Hi Kuan-Wei,
> 
> I didn't receive the original email for some reason...
> We've got a performance test for the function in find_bit_benchmark.
> Can you print before/after here?
> 
> Thanks,
> Yury
>

Hi Yury,

Here are the benchmark results:

Before:
               Start testing find_bit() with random-filled bitmap
[    0.299085] fbcon: Taking over console
[    0.299820] find_next_bit:                  606286 ns, 164169 iterations
[    0.300463] find_next_zero_bit:             641072 ns, 163512 iterations
[    0.300996] find_last_bit:                  531027 ns, 164169 iterations
[    0.305233] find_nth_bit:                  4235859 ns,  16454 iterations
[    0.306434] find_first_bit:                1199357 ns,  16455 iterations
[    0.321616] find_first_and_bit:           15179667 ns,  32869 iterations
[    0.321917] find_next_and_bit:              298836 ns,  73875 iterations
[    0.321918] 
               Start testing find_bit() with sparse bitmap
[    0.321953] find_next_bit:                    7931 ns,    656 iterations
[    0.323201] find_next_zero_bit:            1246980 ns, 327025 iterations
[    0.323210] find_last_bit:                    8000 ns,    656 iterations
[    0.324427] find_nth_bit:                  1213161 ns,    655 iterations
[    0.324813] find_first_bit:                 384747 ns,    656 iterations
[    0.324817] find_first_and_bit:               2220 ns,      1 iterations
[    0.324820] find_next_and_bit:                1831 ns,      1 iterations

After:
               Start testing find_bit() with random-filled bitmap
[    0.305081] fbcon: Taking over console
[    0.306126] find_next_bit:                  854517 ns, 163960 iterations
[    0.307041] find_next_zero_bit:             911725 ns, 163721 iterations
[    0.307711] find_last_bit:                  668261 ns, 163960 iterations
[    0.311160] find_nth_bit:                  3447530 ns,  16372 iterations
[    0.312358] find_first_bit:                1196633 ns,  16373 iterations
[    0.327191] find_first_and_bit:           14830129 ns,  32951 iterations
[    0.327503] find_next_and_bit:              310560 ns,  73719 iterations
[    0.327504] 
               Start testing find_bit() with sparse bitmap
[    0.327539] find_next_bit:                    7633 ns,    656 iterations
[    0.328787] find_next_zero_bit:            1247398 ns, 327025 iterations
[    0.328797] find_last_bit:                    8425 ns,    656 iterations
[    0.330034] find_nth_bit:                  1234044 ns,    655 iterations
[    0.330428] find_first_bit:                 392086 ns,    656 iterations
[    0.330431] find_first_and_bit:               1980 ns,      1 iterations
[    0.330434] find_next_and_bit:                1831 ns,      1 iterations

Some benchmarks seem to have worsened after applying this patch.
However, unless I'm mistaken, the fns() changes should only affect the
results of find_nth_bit, while the others are just random fluctuations.
Should I include the above benchmark data in the commit message and
send a v2 patch?

Additionally, I apologize for you not receiving the email. I received
the following "Message not delivered" email, but I'm unsure if it's
related and what caused the error:

Date: Sat, 27 Apr 2024 04:29:04 +0000 (UTC)
From: do-not-reply@xxxxxxxxxxxxxxx
To: visitorckw@xxxxxxxxx
Subject: Undelivered Mail

This is an automated message from mail service of vivek.yagnik@xxxxxxxxxxxxxxx

⚠ Message not delivered
------------------ Message details ------------------
From: visitorckw@xxxxxxxxx
To: vivek.yagnik@xxxxxxxxxxxxxxx
Sent: 2024-04-27T04:29:03.000Z
Subject: [PATCH] bitops: Optimize fns() for improved performance
Failure reason:  <vivek.yagnik@xxxxxxxxxxxxxxx>: host     sophosemail-com.mail.protection.outlook.com[52.101.144.3]
+said: 451 4.4.4     Mail received as unauthenticated, incoming to a recipient domain configured     in a hosted tenant
+which has no mail-enabled subscriptions. ATTR5     [MA1PEPF000072B2.INDPRD01.PROD.OUTLOOK.COM 2024-04-27T04:29:03.836Z
+08DC631634A0BBEB] (in reply to end of DATA command)

Regards,
Kuan-Wei

> > Link: https://lkml.kernel.org/r/20240426035152.956702-1-visitorckw@xxxxxxxxx
> > Signed-off-by: Kuan-Wei Chiu <visitorckw@xxxxxxxxx>
> > Cc: Ching-Chun (Jim) Huang <jserv@xxxxxxxxxxxxxxxx>
> > Cc: Yury Norov <yury.norov@xxxxxxxxx>
> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > ---
> >
> >  include/linux/bitops.h |   12 ++++--------
> >  1 file changed, 4 insertions(+), 8 deletions(-)
> >
> > --- a/include/linux/bitops.h~bitops-optimize-fns-for-improved-performance
> > +++ a/include/linux/bitops.h
> > @@ -254,16 +254,12 @@ static inline unsigned long __ffs64(u64
> >   */
> >  static inline unsigned long fns(unsigned long word, unsigned int n)
> >  {
> > -       unsigned int bit;
> > +       unsigned int i;
> >
> > -       while (word) {
> > -               bit = __ffs(word);
> > -               if (n-- == 0)
> > -                       return bit;
> > -               __clear_bit(bit, &word);
> > -       }
> > +       for (i = 0; word && i < n; i++)
> > +               word &= word - 1;
> >
> > -       return BITS_PER_LONG;
> > +       return word ? __ffs(word) : BITS_PER_LONG;
> >  }
> >
> >  /**
> > _
> >
> > Patches currently in -mm which might be from visitorckw@xxxxxxxxx are
> >
> > bitops-optimize-fns-for-improved-performance.patch
> >