On Mon, May 15, 2017 at 10:31:17PM +0200, Arnd Bergmann wrote: > On Mon, May 15, 2017 at 6:17 PM, Yury Norov <ynorov@xxxxxxxxxxxxxxxxxx> wrote: > > On Mon, May 15, 2017 at 06:06:18PM +0200, Arnd Bergmann wrote: > >> On Mon, May 15, 2017 at 5:47 PM, Yury Norov <ynorov@xxxxxxxxxxxxxxxxxx> wrote: > >> > On Sun, May 14, 2017 at 08:09:17PM +0200, Ingo Molnar wrote: > >> > > >> > I also think that sched_find_first_bit() may be faster that find_first_bit() > >> > because it's inlined in the caller. We can do so for find_first_bit() if > >> > it takes small sizes at compile time, and so all parts of kernel will > >> > use fast find_first_bit, not only sched. > >> > >> I suspect the first step would be to 'select GENERIC_FIND_FIRST_BIT' > >> on ARM64, which should already improve the performance for those > >> files that never call the 'next' variants. > >> > >> Adding an inline version of find_first_{,zero_}bit could also help, but > >> is harder to quantify. > >> > > > > I checked again, and in fact I measured on top of this patch: > > https://lkml.org/lkml/2017/5/13/137 > > So find_first_bit is already enabled. > > Ok. I've played around with this for a bit more and came to a generic > version that is almost as good as the current sched_find_first_bit() > on x86 (one extra comparison): > > +#define sched_find_first_bit(b) find_first_bit(b, 128) > -extern unsigned long find_first_bit(const unsigned long *addr, > +extern unsigned long __find_first_bit(const unsigned long *addr, > unsigned long size); > > +static inline unsigned long find_first_bit(const unsigned long *addr, > + unsigned long size) > +{ > + unsigned long idx; > + > + if (!__builtin_constant_p(size)) > + return __find_first_bit(addr, size); > + > + idx = 0; > + switch (size) { > + case BITS_PER_LONG * 4: > + if (addr[0]) > + return __ffs(addr[0]) + idx; > + addr++; > + idx += BITS_PER_LONG; > + case BITS_PER_LONG * 3: > + if (addr[0]) > + return __ffs(addr[0]) + idx; > + addr++; > + idx += BITS_PER_LONG; > + case BITS_PER_LONG * 2: > + if (addr[0]) > + return __ffs(addr[0]) + idx; > + addr++; > + idx += BITS_PER_LONG; > + case BITS_PER_LONG * 1: > + if (addr[0]) > + return __ffs(addr[0]) + idx; > + addr++; > + idx += BITS_PER_LONG; > + return idx; > + } > + > + return __find_first_bit(addr, size); > +} > Yes, something like this. But size is not the multiple of BITS_PER_LONG in general. This should work better: switch (round_up(size), BITS_PER_LONG) { case BITS_PER_LONG * 4: if (addr[0]) goto ret; addr++; idx += BITS_PER_LONG; case BITS_PER_LONG * 3: if (addr[0]) goto ret; addr++; idx += BITS_PER_LONG; case BITS_PER_LONG * 2: if (addr[0]) goto ret; addr++; idx += BITS_PER_LONG; case BITS_PER_LONG * 1: if (addr[0]) goto ret; addr++; idx += BITS_PER_LONG; return idx; } return __find_first_bit(addr, size); ret: return idx + min(__ffs(addr[0]), size % BITS_PER_LONG; } (I didn't test it yet though) > However, on architectures that rely on > include/asm-generic/bitops/__ffs.h or something > similarly verbose, this would just add needless bloat > to the size rather than actually making a difference > in performance. > > Arnd