On Sat, Jun 19, 2021 at 10:28:07AM -0700, Yury Norov wrote: > On Sat, Jun 19, 2021 at 05:24:15PM +0100, Marc Zyngier wrote: > > On Fri, 18 Jun 2021 20:57:34 +0100, > > Yury Norov <yury.norov@xxxxxxxxx> wrote: > > > > > > The macros iterate thru all set/clear bits in a bitmap. They search a > > > first bit using find_first_bit(), and the rest bits using find_next_bit(). > > > > > > Since find_next_bit() is called shortly after find_first_bit(), we can > > > save few lines of I-cache by not using find_first_bit(). > > > > Really? > > > > > > > > Signed-off-by: Yury Norov <yury.norov@xxxxxxxxx> > > > --- > > > include/linux/find.h | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/linux/find.h b/include/linux/find.h > > > index 4500e8ab93e2..ae9ed52b52b8 100644 > > > --- a/include/linux/find.h > > > +++ b/include/linux/find.h > > > @@ -280,7 +280,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned > > > #endif > > > > > > #define for_each_set_bit(bit, addr, size) \ > > > - for ((bit) = find_first_bit((addr), (size)); \ > > > + for ((bit) = find_next_bit((addr), (size), 0); \ > > > > On which architecture do you observe a gain? Only 32bit ARM and m68k > > implement their own version of find_first_bit(), and everyone else > > uses the canonical implementation: > > And those who enable GENERIC_FIND_FIRST_BIT - x86, arm64, arc, mips > and s390. > > > #ifndef find_first_bit > > #define find_first_bit(addr, size) find_next_bit((addr), (size), 0) > > #endif > > > > These architectures explicitly have different implementations for > > find_first_bit() and find_next_bit() because they can do better > > (whether that is true or not is another debate). I don't think you > > should remove this optimisation until it has been measured on these > > two architectures. > > This patch is based on a series that enables separate implementation > of find_first_bit() for all architectures; according to my tests, > find_first* is ~ twice faster than find_next* on arm64 and x86. > > https://lore.kernel.org/lkml/20210612123639.329047-1-yury.norov@xxxxxxxxx/T/#t > > After applying the series, I noticed that my small kernel module that > calls for_each_set_bit() is now using find_first_bit() to just find > one bit, and find_next_bit() for all others. I think it's better to > always use find_next_bit() in this case to minimize the chance of > cache miss. But if it's not that obvious, I'll try to write some test. This test measures the difference between for_each_set_bit() and for_each_set_bit_from(). diff --git a/lib/find_bit_benchmark.c b/lib/find_bit_benchmark.c index 5637c5711db9..1f37e99090b0 100644 --- a/lib/find_bit_benchmark.c +++ b/lib/find_bit_benchmark.c @@ -111,6 +111,59 @@ static int __init test_find_next_and_bit(const void *bitmap, return 0; } +#ifdef CONFIG_X86_64 +#define flush_cache_all() wbinvd() +#endif + +static int __init test_for_each_set_bit(int flags) +{ +#ifdef flush_cache_all + DECLARE_BITMAP(bm, BITS_PER_LONG * 2); + unsigned long i, cnt = 0; + ktime_t time; + + bm[0] = 1; bm[1] = 0; + + time = ktime_get(); + while (cnt < 1000) { + if (flags) + flush_cache_all(); + for_each_set_bit(i, bm, BITS_PER_LONG * 2) + cnt++; + } + + time = ktime_get() - time; + + pr_err("for_each_set_bit: %18llu ns, %6ld iterations\n", time, cnt); +#endif + return 0; +} + +static int __init test_for_each_set_bit_from(int flags) +{ +#ifdef flush_cache_all + DECLARE_BITMAP(bm, BITS_PER_LONG * 2); + unsigned long i, cnt = 0; + ktime_t time; + + bm[0] = 1; bm[1] = 0; + + time = ktime_get(); + while (cnt < 1000) { + if (flags) + flush_cache_all(); + i = 0; + for_each_set_bit_from(i, bm, BITS_PER_LONG * 2) + cnt++; + } + + time = ktime_get() - time; + + pr_err("for_each_set_bit_from:%16llu ns, %6ld iterations\n", time, cnt); +#endif + return 0; +} + static int __init find_bit_test(void) { unsigned long nbits = BITMAP_LEN / SPARSE; @@ -147,6 +200,16 @@ static int __init find_bit_test(void) test_find_first_bit(bitmap, BITMAP_LEN); test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN); + pr_err("\nStart testing for_each_bit()\n"); + + test_for_each_set_bit(0); + test_for_each_set_bit_from(0); + + pr_err("\nStart testing for_each_bit() with cash flushing\n"); + + test_for_each_set_bit(1); + test_for_each_set_bit_from(1); + /* * Everything is OK. Return error just to let user run benchmark * again without annoying rmmod. Here on each iteration: - for_each_set_bit() calls find_first_bit() once, and find_next_bit() once. - for_each_set_bit_from() calls find_next_bit() twice. On my AMD Ryzen 7 4700U, the result is like this: Start testing for_each_bit() for_each_set_bit: 15296 ns, 1000 iterations for_each_set_bit_from: 15225 ns, 1000 iterations Start testing for_each_bit() with cash flushing for_each_set_bit: 547626 ns, 1000 iterations for_each_set_bit_from: 497899 ns, 1000 iterations for_each_set_bit_from() is ~10% faster than for_each_set_bit() in case of cold caches, and no significant difference was observed if flush_cache_all() is not called. So, it looks reasonable to switch for_each_set_bit() to use find_next_bit() only. Thanks, Yury