Re: [PATCH] alpha: simplify and optimize sched_find_first_bit

Matt Turner <mattst88@xxxxxxxxx> · Wed, 28 Apr 2010 22:08:54 -0400

On Thu, Apr 8, 2010 at 3:15 PM, Richard Henderson <rth@xxxxxxxxxxx> wrote:
> On 04/08/2010 11:34 AM, mattst88@xxxxxxxxx wrote:
>> +     asm(
>> +             "cmoveq %0,64,%1        # ofs = (b[0] ? ofs : 64);\n"
>> +             "cmoveq %0,%2,%0        # temp = (b[0] ? b[0] : b[1]);\n"
>> +             "cttz   %0,%0           # output = cttz(temp);\n "
>> +             : "=r" (output), "=r" (ofs)
>> +             : "r" (b[1]), "0" (b[0]), "1" (0)
>
> I must say I'd also prefer a comment like
>
>        /* This is equivalent to
>                ofs = (b[0] ? 0 : 64);
>                tmp = (b[0] ? b[0] : b[1]);
>           but is a bit faster than what GCC would produce on its own.  */
>        asm("cmoveq %0,64,%1\n\tcmoveq %0,%2,%0"
>            : "=r"(output), "=r"(ofs)
>            : "r"(b[1]), "0"(b[0]), "1"(0));
>
> ... except that I can't see that it is, at least for mainline gcc.
>
> [anchor:~] cat z.c
> long foo(const unsigned long *b)
> {
>  unsigned long b0, b1, ofs, tmp;
>
>  b0 = b[0];
>  b1 = b[1];
>  ofs = (b0 ? 0 : 64);
>  tmp = (b0 ? b0 : b1);
>
>  /* tmp = __ffs(tmp); -- elided for clarity wrt ev5 vs ev67 */
>  return tmp + ofs;
> }
>
> -mcpu=ev5 -Os (to avoid nop padding):
>        ldq $2,0($16)
>        ldq $1,8($16)
>        lda $0,64($31)
>        cmovne $2,0,$0
>        cmovne $2,$2,$1
>        addq $0,$1,$0
>
> -mcpu=ev6 -Os:
>        ldq $0,0($16)
>        ldq $1,8($16)
>        cmovne $0,$0,$1
>        cmpeq $0,0,$0
>        sll $0,6,$0
>        addq $0,$1,$0
>
> I seem to recall that cmov is slightly more expensive on ev6,
> so gcc doesn't prefer it and came up with an equivalent using
> cmpeq+sll.
>
> If some previous version of gcc isn't so smart, I'm ok with
> continuing to use the asm.
>
>
> r~
>

So with your test program, the code generation results are:

4.3.4 -Os: good
4.3.4 -O1: bad
4.3.4 -O2: good
4.3.4 -O3: good

4.4.3 -Os: bad
4.4.3 -O1: bad
4.4.3 -O2: bad
4.4.3 -O3: good

4.5.0 -Os: good
4.5.0 -O1: bad
4.5.0 -O2: good
4.5.0 -O3: good

 o -O3 is produces good code in all versions.
 o -O1 is bad in all versions.
 o -Os and -O2 regressed from 4.3.4 to 4.4.3,
   but are back to 4.3.4 quality as of 4.5.0.

All produced cmov instructions just as you said.

My patch doesn't help any of the bad cases and even causes some that
were good to produce worse code, so it's not useful. Does any of this
look like it should warrant a gcc bug report, Richard?

I'll send a patch just to update sched_find_first_bit to search just
the first 100-bits.

Thanks!
Matt
Attachment:
test

Description: Binary data