On 1/14/20 1:49 PM, Linus Torvalds wrote: > On Tue, Jan 14, 2020 at 1:37 PM Vineet Gupta <Vineet.Gupta1@xxxxxxxxxxxx> wrote: >> >> On 1/14/20 12:42 PM, Arnd Bergmann wrote: >>> >>> What's wrong with the generic version on little-endian? Any >>> chance you can find a way to make it work as well for you as >>> this copy? >> >> find_zero() by default doesn't use pop count instructions. > > Don't you think the generic find_zero() is likely just as fast as the > pop count instruction? On 32-bit, I think it's like a shift and a mask > and a couple of additions. You are right that in grand scheme things it may be less than noise. ARC pop count version # bits = (bits - 1) & ~bits; # return bits >> 7; sub r0,r6,1 bic r6,r0,r6 lsr r0,r6,7 # return fls(mask) >> 3; fls.f r0, r0 add.nz r0, r0, 1 asr r5,r0,3 j_s.d [blink] Generic version # bits = (bits - 1) & ~bits; # return bits >> 7; sub r5,r6,1 bic r6,r5,r6 lsr r5,r6,7 # unsigned long a = (0x0ff0001+mask) >> 23; # return a & mask; add r0,r5,0x0ff0001 <-- this is 8 byte instruction though lsr_s r0,r0,23 and r5,r5,r0 j_s.d [blink] But its the usual itch/inclination of arch people to try and use the specific instruction if available. > > The 64-bit case has a multiply that is likely expensive unless you > have a good multiplication unit (but what 64-bit architecture > doesn't?), but the generic 32-bit LE code should already be pretty > close to optimal, and it might not be worth it to worry about it. > > (The big-endian case is very different, and architectures really can > do much better. But LE allows for bit tricks using the carry chain) -Vineet