On Mon, May 15, 2017 at 10:58 PM, Yury Norov <ynorov@xxxxxxxxxxxxxxxxxx> wrote: > On Mon, May 15, 2017 at 10:31:17PM +0200, Arnd Bergmann wrote: >> On Mon, May 15, 2017 at 6:17 PM, Yury Norov <ynorov@xxxxxxxxxxxxxxxxxx> wrote: >> > Yes, something like this. But size is not the multiple of BITS_PER_LONG in > general. This should work better: > > switch (round_up(size), BITS_PER_LONG) { > case BITS_PER_LONG * 4: > if (addr[0]) > goto ret; > addr++; > idx += BITS_PER_LONG; > case BITS_PER_LONG * 3: > if (addr[0]) > goto ret; > addr++; > idx += BITS_PER_LONG; > case BITS_PER_LONG * 2: > if (addr[0]) > goto ret; > addr++; > idx += BITS_PER_LONG; > case BITS_PER_LONG * 1: > if (addr[0]) > goto ret; > addr++; > idx += BITS_PER_LONG; > return idx; > } > > return __find_first_bit(addr, size); > > ret: > return idx + min(__ffs(addr[0]), size % BITS_PER_LONG; > } > > (I didn't test it yet though) > >> However, on architectures that rely on >> include/asm-generic/bitops/__ffs.h or something >> similarly verbose, this would just add needless bloat >> to the size rather than actually making a difference >> in performance. I tried something along these lines earlier and couldn't get it to produce the comparable object code in the common case. For sched_find_first_bit() I was able to cheat and pass 128 as the length (along with a comment), and most others are either multiples of BITS_PER_LONG, or they are not constant. Arnd