NACK. I'm sorry, but it seems you have to send v6. See comments inline. On Tue, Nov 28, 2017 at 02:13:34PM +0100, Clement Courbet wrote: > We've measured that we spend ~0.6% of sys cpu time in cpumask_next_and(). > It's essentially a joined iteration in search for a non-zero bit, which > is currently implemented as a lookup join (find a nonzero bit on the > lhs, lookup the rhs to see if it's set there). > > Implement a direct join (find a nonzero bit on the incrementally built > join). > > For cpumask_next_and, direct benchmarking shows that it's 1.17x to 14x > faster with a geometric mean of 2.1 on 32 CPUs [1]. No impact on memory > usage. > > Also added generic bitmap benchmarks in the new `test_find_bit` module > for the reference and new implementations (results: see > `find_next_and_bit` and `find_next_and_bit_ref` in [2] and [3] below). > Note that on Arm (), the new c implementation still outperforms the > old one that uses c+ the asm implementation of `find_next_bit` [3]. What is 'c+'? Is it typo? If you find generic find_bit() on arm faster that asm one, we'd definitely drop that piece of asm. I have this check it in my long list. > [1] Approximate benchmark code: > > ``` > unsigned long src1p[nr_cpumask_longs] = {pattern1}; > unsigned long src2p[nr_cpumask_longs] = {pattern2}; > for (/*a bunch of repetitions*/) { > for (int n = -1; n <= nr_cpu_ids; ++n) { > asm volatile("" : "+rm"(src1p)); // prevent any optimization > asm volatile("" : "+rm"(src2p)); > unsigned long result = cpumask_next_and(n, src1p, src2p); > asm volatile("" : "+rm"(result)); > } > } > ``` > > Results: > pattern1 pattern2 time_before/time_after > 0x0000ffff 0x0000ffff 1.65 > 0x0000ffff 0x00005555 2.24 > 0x0000ffff 0x00001111 2.94 > 0x0000ffff 0x00000000 14.0 > 0x00005555 0x0000ffff 1.67 > 0x00005555 0x00005555 1.71 > 0x00005555 0x00001111 1.90 > 0x00005555 0x00000000 6.58 > 0x00001111 0x0000ffff 1.46 > 0x00001111 0x00005555 1.49 > 0x00001111 0x00001111 1.45 > 0x00001111 0x00000000 3.10 > 0x00000000 0x0000ffff 1.18 > 0x00000000 0x00005555 1.18 > 0x00000000 0x00001111 1.17 > 0x00000000 0x00000000 1.25 > ----------------------------- > geo.mean 2.06 > > [2] test_find_next_bit, X86 (skylake) > > [ 3913.477422] Start testing find_bit() with random-filled bitmap > [ 3913.477847] find_next_bit: 160868 cycles, 16484 iterations > [ 3913.477933] find_next_zero_bit: 169542 cycles, 16285 iterations > [ 3913.478036] find_last_bit: 201638 cycles, 16483 iterations > [ 3913.480214] find_first_bit: 4353244 cycles, 16484 iterations > [ 3913.480216] Start testing find_next_and_bit() with random-filled > bitmap > [ 3913.481027] find_next_and_bit_ref: 319444 cycles, 8216 iterations > [ 3913.481074] find_next_and_bit: 89604 cycles, 8216 iterations > [ 3913.481075] Start testing find_bit() with sparse bitmap > [ 3913.481078] find_next_bit: 2536 cycles, 66 iterations > [ 3913.481252] find_next_zero_bit: 344404 cycles, 32703 iterations > [ 3913.481255] find_last_bit: 2006 cycles, 66 iterations > [ 3913.481265] find_first_bit: 17488 cycles, 66 iterations > [ 3913.481266] Start testing find_next_and_bit() with sparse bitmap > [ 3913.481270] find_next_and_bit_ref: 2486 cycles, 1 iterations > [ 3913.481272] find_next_and_bit: 764 cycles, 1 iterations > > [3] test_find_next_bit, arm (v7 odroid XU3). > > [ 267.206928] Start testing find_bit() with random-filled bitmap > [ 267.214752] find_next_bit: 4474 cycles, 16419 iterations > [ 267.221850] find_next_zero_bit: 5976 cycles, 16350 iterations > [ 267.229294] find_last_bit: 4209 cycles, 16419 iterations > [ 267.279131] find_first_bit: 1032991 cycles, 16420 iterations > [ 267.286265] Start testing find_next_and_bit() with random-filled > bitmap > [ 267.294895] find_next_and_bit_ref: 7572 cycles, 8140 iterations > [ 267.302386] find_next_and_bit: 2290 cycles, 8140 iterations > [ 267.309422] Start testing find_bit() with sparse bitmap > [ 267.316054] find_next_bit: 191 cycles, 66 iterations > [ 267.322726] find_next_zero_bit: 8758 cycles, 32703 iterations > [ 267.329803] find_last_bit: 84 cycles, 66 iterations > [ 267.336169] find_first_bit: 4118 cycles, 66 iterations > [ 267.342627] Start testing find_next_and_bit() with sparse bitmap > [ 267.349992] find_next_and_bit_ref: 193 cycles, 1 iterations > [ 267.356919] find_next_and_bit: 91 cycles, 1 iterations This is old version of test based on get_cycles. New one is based on ktime_get and has other minor changes. I think you'd rerun tests to not confuse readers. New version is already in linux-next. > Signed-off-by: Clement Courbet <courbet@xxxxxxxxxx> > --- > Changes in v2: > - Refactored _find_next_common_bit into _find_next_bit., as suggested > by Yury Norov. This has no adverse effects on the performance side, > as the compiler successfully inlines the code. > > Changes in v3: > - Fixes find_next_and_bit() declaration. > - Synchronize _find_next_bit_le() with _find_next_bit() > - Synchronize the code in tools/lib/find_bit.c > - Add find_next_and_bit to guard code > - Fix invert value (bad sync with our internal tree on which I'm doing > the testing). > > Changes in v4: > - Mark _find_next_bit() inline. > > Changes in v5: > - Added benchmarks to test_find_bit.cc > - Fixed arm compilation: added missing header to arm bitops.h > > arch/arm/include/asm/bitops.h | 1 + > arch/unicore32/include/asm/bitops.h | 2 ++ > include/asm-generic/bitops/find.h | 20 +++++++++++ > include/linux/bitmap.h | 6 +++- > lib/cpumask.c | 9 ++--- > lib/find_bit.c | 59 ++++++++++++++++++++++++--------- > lib/test_find_bit.c | 47 +++++++++++++++++++++++++- > tools/include/asm-generic/bitops/find.h | 16 +++++++++ > tools/lib/find_bit.c | 40 ++++++++++++++++------ > 9 files changed, 168 insertions(+), 32 deletions(-) > > diff --git a/arch/arm/include/asm/bitops.h b/arch/arm/include/asm/bitops.h > index ce5ee762ed66..4cab9bb823fb 100644 > --- a/arch/arm/include/asm/bitops.h > +++ b/arch/arm/include/asm/bitops.h > @@ -338,6 +338,7 @@ static inline int find_next_bit_le(const void *p, int size, int offset) > > #endif > > +#include <asm-generic/bitops/find.h> > #include <asm-generic/bitops/le.h> > > /* > diff --git a/arch/unicore32/include/asm/bitops.h b/arch/unicore32/include/asm/bitops.h > index 401f597bc38c..c0cbdbe17168 100644 > --- a/arch/unicore32/include/asm/bitops.h > +++ b/arch/unicore32/include/asm/bitops.h > @@ -44,4 +44,6 @@ static inline int fls(int x) > #define find_first_bit find_first_bit > #define find_first_zero_bit find_first_zero_bit > > +#include <asm-generic/bitops/find.h> > + > #endif /* __UNICORE_BITOPS_H__ */ > diff --git a/include/asm-generic/bitops/find.h b/include/asm-generic/bitops/find.h > index 1ba611e16fa0..8a1ee10014de 100644 > --- a/include/asm-generic/bitops/find.h > +++ b/include/asm-generic/bitops/find.h > @@ -16,6 +16,22 @@ extern unsigned long find_next_bit(const unsigned long *addr, unsigned long > size, unsigned long offset); > #endif > > +#ifndef find_next_and_bit > +/** > + * find_next_and_bit - find the next set bit in both memory regions > + * @addr1: The first address to base the search on > + * @addr2: The second address to base the search on > + * @offset: The bitnumber to start searching at > + * @size: The bitmap size in bits > + * > + * Returns the bit number for the next set bit > + * If no bits are set, returns @size. > + */ > +extern unsigned long find_next_and_bit(const unsigned long *addr1, > + const unsigned long *addr2, unsigned long size, > + unsigned long offset); > +#endif > + > #ifndef find_next_zero_bit > /** > * find_next_zero_bit - find the next cleared bit in a memory region > @@ -55,8 +71,12 @@ extern unsigned long find_first_zero_bit(const unsigned long *addr, > unsigned long size); > #else /* CONFIG_GENERIC_FIND_FIRST_BIT */ > > +#ifndef find_first_bit > #define find_first_bit(addr, size) find_next_bit((addr), (size), 0) > +#endif > +#ifndef find_first_zero_bit > #define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0) > +#endif How this change related to the find_next_and_bit? Does unguarded version breaks something? Please elaborate. > #endif /* CONFIG_GENERIC_FIND_FIRST_BIT */ > > diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h > index 3489253e38fc..45a9e169d0fd 100644 > --- a/include/linux/bitmap.h > +++ b/include/linux/bitmap.h > @@ -83,8 +83,12 @@ > * test_and_change_bit(bit, addr) Change bit and return old value > * find_first_zero_bit(addr, nbits) Position first zero bit in *addr > * find_first_bit(addr, nbits) Position first set bit in *addr > - * find_next_zero_bit(addr, nbits, bit) Position next zero bit in *addr >= bit > + * find_next_zero_bit(addr, nbits, bit) > + * Position next zero bit in *addr >= bit > * find_next_bit(addr, nbits, bit) Position next set bit in *addr >= bit > + * find_next_and_bit(addr1, addr2, nbits, bit) > + * Same as find_first_bit, but in > + * (*addr1 & *addr2) Nit. Addr1, addr2 - doesn't sound self-explainative. What about find_next_and_bit(addr1, and_map, nbits, bit), or something like this? > * > */ > > diff --git a/lib/cpumask.c b/lib/cpumask.c > index 35fe142ebb5e..beca6244671a 100644 > --- a/lib/cpumask.c > +++ b/lib/cpumask.c > @@ -33,10 +33,11 @@ EXPORT_SYMBOL(cpumask_next); > int cpumask_next_and(int n, const struct cpumask *src1p, > const struct cpumask *src2p) > { > - while ((n = cpumask_next(n, src1p)) < nr_cpu_ids) > - if (cpumask_test_cpu(n, src2p)) > - break; > - return n; > + /* -1 is a legal arg here. */ > + if (n != -1) > + cpumask_check(n); > + return find_next_and_bit(cpumask_bits(src1p), cpumask_bits(src2p), > + nr_cpumask_bits, n + 1); > } > EXPORT_SYMBOL(cpumask_next_and); > > diff --git a/lib/find_bit.c b/lib/find_bit.c > index 6ed74f78380c..ee3df93ba69a 100644 > --- a/lib/find_bit.c > +++ b/lib/find_bit.c > @@ -21,22 +21,29 @@ > #include <linux/export.h> > #include <linux/kernel.h> > > -#if !defined(find_next_bit) || !defined(find_next_zero_bit) > +#if !defined(find_next_bit) || !defined(find_next_zero_bit) || \ > + !defined(find_next_and_bit) > > /* > - * This is a common helper function for find_next_bit and > - * find_next_zero_bit. The difference is the "invert" argument, which > - * is XORed with each fetched word before searching it for one bits. > + * This is a common helper function for find_next_bit, find_next_zero_bit, and > + * find_next_and_bit. The differences are: > + * - The "invert" argument, which is XORed with each fetched word before > + * searching it for one bits. > + * - The optional "addr2", which is anded with "addr1" if present. The optional "and", which is anded with "addr" if present. - sounds better, isn't? > */ > -static unsigned long _find_next_bit(const unsigned long *addr, > - unsigned long nbits, unsigned long start, unsigned long invert) > +static inline unsigned long _find_next_bit(const unsigned long *addr1, > + const unsigned long *addr2, unsigned long nbits, > + unsigned long start, unsigned long invert) > { > unsigned long tmp; > > if (unlikely(start >= nbits)) > return nbits; > > - tmp = addr[start / BITS_PER_LONG] ^ invert; > + tmp = addr1[start / BITS_PER_LONG]; > + if (addr2) > + tmp &= addr2[start / BITS_PER_LONG]; > + tmp ^= invert; > > /* Handle 1st word. */ > tmp &= BITMAP_FIRST_WORD_MASK(start); > @@ -47,7 +54,10 @@ static unsigned long _find_next_bit(const unsigned long *addr, > if (start >= nbits) > return nbits; > > - tmp = addr[start / BITS_PER_LONG] ^ invert; > + tmp = addr1[start / BITS_PER_LONG]; > + if (addr2) > + tmp &= addr2[start / BITS_PER_LONG]; > + tmp ^= invert; > } > > return min(start + __ffs(tmp), nbits); > @@ -61,7 +71,7 @@ static unsigned long _find_next_bit(const unsigned long *addr, > unsigned long find_next_bit(const unsigned long *addr, unsigned long size, > unsigned long offset) > { > - return _find_next_bit(addr, size, offset, 0UL); > + return _find_next_bit(addr, NULL, size, offset, 0UL); > } > EXPORT_SYMBOL(find_next_bit); > #endif > @@ -70,11 +80,21 @@ EXPORT_SYMBOL(find_next_bit); > unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size, > unsigned long offset) > { > - return _find_next_bit(addr, size, offset, ~0UL); > + return _find_next_bit(addr, NULL, size, offset, ~0UL); > } > EXPORT_SYMBOL(find_next_zero_bit); > #endif > > +#if !defined(find_next_and_bit) > +unsigned long find_next_and_bit(const unsigned long *addr1, > + const unsigned long *addr2, unsigned long size, > + unsigned long offset) > +{ > + return _find_next_bit(addr1, addr2, size, offset, 0UL); > +} > +EXPORT_SYMBOL(find_next_and_bit); > +#endif > + > #ifndef find_first_bit > /* > * Find the first set bit in a memory region. > @@ -146,15 +166,19 @@ static inline unsigned long ext2_swab(const unsigned long y) > } > > #if !defined(find_next_bit_le) || !defined(find_next_zero_bit_le) > -static unsigned long _find_next_bit_le(const unsigned long *addr, > - unsigned long nbits, unsigned long start, unsigned long invert) > +static inline unsigned long _find_next_bit_le(const unsigned long *addr1, > + const unsigned long *addr2, unsigned long nbits, > + unsigned long start, unsigned long invert) > { > unsigned long tmp; > > if (unlikely(start >= nbits)) > return nbits; > > - tmp = addr[start / BITS_PER_LONG] ^ invert; > + tmp = addr1[start / BITS_PER_LONG]; > + if (addr2) > + tmp &= addr2[start / BITS_PER_LONG]; > + tmp ^= invert; > > /* Handle 1st word. */ > tmp &= ext2_swab(BITMAP_FIRST_WORD_MASK(start)); > @@ -165,7 +189,10 @@ static unsigned long _find_next_bit_le(const unsigned long *addr, > if (start >= nbits) > return nbits; > > - tmp = addr[start / BITS_PER_LONG] ^ invert; > + tmp = addr1[start / BITS_PER_LONG]; > + if (addr2) > + tmp &= addr2[start / BITS_PER_LONG]; > + tmp ^= invert; > } > > return min(start + __ffs(ext2_swab(tmp)), nbits); > @@ -176,7 +203,7 @@ static unsigned long _find_next_bit_le(const unsigned long *addr, > unsigned long find_next_zero_bit_le(const void *addr, unsigned > long size, unsigned long offset) > { > - return _find_next_bit_le(addr, size, offset, ~0UL); > + return _find_next_bit_le(addr, NULL, size, offset, ~0UL); > } > EXPORT_SYMBOL(find_next_zero_bit_le); > #endif > @@ -185,7 +212,7 @@ EXPORT_SYMBOL(find_next_zero_bit_le); > unsigned long find_next_bit_le(const void *addr, unsigned > long size, unsigned long offset) > { > - return _find_next_bit_le(addr, size, offset, 0UL); > + return _find_next_bit_le(addr, NULL, size, offset, 0UL); > } > EXPORT_SYMBOL(find_next_bit_le); > #endif > diff --git a/lib/test_find_bit.c b/lib/test_find_bit.c > index f4394a36f9aa..90773efa4694 100644 > --- a/lib/test_find_bit.c > +++ b/lib/test_find_bit.c > @@ -35,6 +35,7 @@ > #define SPARSE 500 > > static DECLARE_BITMAP(bitmap, BITMAP_LEN) __initdata; > +static DECLARE_BITMAP(bitmap2, BITMAP_LEN) __initdata; > > /* > * This is Schlemiel the Painter's algorithm. It should be called after > @@ -107,6 +108,42 @@ static int __init test_find_last_bit(const void *bitmap, unsigned long len) > return 0; > } > > +static int __init test_find_next_and_bit(const void *bitmap, > + const void *bitmap2, unsigned long len) > +{ > + unsigned long i, cnt; > + cycles_t cycles; > + > + cycles = get_cycles(); Please switch to ktime_get, as other tests do. > + for (cnt = i = 0; i < BITMAP_LEN; cnt++) > + i = find_next_and_bit(bitmap, bitmap2, BITMAP_LEN, i+1); > + cycles = get_cycles() - cycles; > + pr_err("find_next_and_bit: %ld cycles, %ld iterations\n", (long)cycles, > + cnt); Use pretty formatting here to align with others tests output. > + > + return 0; > +} > + > +static int __init test_find_next_and_bit_ref(const void *bitmap, > + const void *bitmap2, unsigned long len) > +{ > + unsigned long i, cnt; > + cycles_t cycles; > + > + cycles = get_cycles(); > + for (cnt = i = 0; i < BITMAP_LEN; cnt++) > + while ((i = find_next_bit(bitmap, BITMAP_LEN, i + 1)) > + < BITMAP_LEN) Nit: while (i < BITMAP_LEN) { i = find_next_bit(bitmap, BITMAP_LEN, i + 1); > + if (test_bit(i, bitmap2)) > + break; } > + > + cycles = get_cycles() - cycles; > + pr_err("find_next_and_bit_ref: %ld cycles, %ld iterations\n", > + (long)cycles, cnt); > + > + return 0; > +} I don't understand the purpose of this. It's obviously clear that test_find_next_and_bit cannot be slower than test_find_next_and_bit_ref We don't have 'reference' implementations for other functions here. If you still think that you need it, please introduce find_next_and_bit_ref() to make the purpose of test clear. (I spent more that a minute to understand what happens here, and my first suggestion was wrong.) > static int __init find_bit_test(void) > { > unsigned long nbits = BITMAP_LEN / SPARSE; > @@ -114,23 +151,31 @@ static int __init find_bit_test(void) > pr_err("\nStart testing find_bit() with random-filled bitmap\n"); > > get_random_bytes(bitmap, sizeof(bitmap)); > + get_random_bytes(bitmap2, sizeof(bitmap2)); > > test_find_next_bit(bitmap, BITMAP_LEN); > test_find_next_zero_bit(bitmap, BITMAP_LEN); > test_find_last_bit(bitmap, BITMAP_LEN); > test_find_first_bit(bitmap, BITMAP_LEN); > + test_find_next_and_bit_ref(bitmap, bitmap2, BITMAP_LEN); > + test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN); > > pr_err("\nStart testing find_bit() with sparse bitmap\n"); > > bitmap_zero(bitmap, BITMAP_LEN); > + bitmap_zero(bitmap2, BITMAP_LEN); > > - while (nbits--) > + while (nbits--) { > __set_bit(prandom_u32() % BITMAP_LEN, bitmap); > + __set_bit(prandom_u32() % BITMAP_LEN, bitmap2); > + } > > test_find_next_bit(bitmap, BITMAP_LEN); > test_find_next_zero_bit(bitmap, BITMAP_LEN); > test_find_last_bit(bitmap, BITMAP_LEN); > test_find_first_bit(bitmap, BITMAP_LEN); > + test_find_next_and_bit_ref(bitmap, bitmap2, BITMAP_LEN); > + test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN); For sparse bitmaps it will be like traversing zero-bitmaps. I doubt this numbers will be representative. Do we need this test at all? > return 0; > } Yury