Re: GCC asm block optimizations on x86_64

Darryl Miles <darryl-mailinglists@xxxxxxxxxxxx> · Wed, 29 Aug 2007 12:11:22 +0100

Rask Ingemann Lambertsen wrote:
On Tue, Aug 28, 2007 at 11:02:49PM +0100, Darryl Miles wrote:
   Peephole definitions check for cases like this and won't do the
optimization clobbering the flags register if the flags register is live at
that point.

So I take it that peephole works by knowing the instructions emitted 
with annotations about the lifetimes of registers / flags / other useful 
stuff to help it.  I was thinking it was a bit more blind to things than 
that.

If this is the case then that leads me to believe setting %edx to 0 
should have a lot of options open to achieve that goal.

 0000000000000090 <u64_divide>:
   00:   49 89 d1                mov    %rdx,%r9	<<- [1] save %rdx in
%r9 for arg-as-return
   03:   48 8b 07                mov    (%rdi),%rax
   06:   ?? ?? ??                xor    %edx,%edx	<<- implicit zero of
high 32bits, would accept xorq %rdx,%rdx

   Right, that's why I suggest using "gcc -S -dp" because then it clearly
shows if it's a 32-bit (*movsi_xxx) or a 64-bit (*movdi_xxx) instruction (as
seen from GCC's point of view, since the actual CPU instruction is the same
in this and several other cases).

The code you are quoting is not generated by GCC but my ideal 
expectation of GCC, see the original u64_divide.c source comment of what 
GCC emits; at the end of this email is the "-O6 -S -dp" version from GCC 
4.0.2.

I did not understand the relevance to knowing if it is (*movsi_xxx) or 
(*movdi_xxx).  From my point of view knowing that would not alter the 2 
original points I was making [1] and [3].  Maybe there is some 
pipelining (or other complex) issue I don't know about which makes the 
emitted code better than what I'm suggesting.

Interestingly if I change the order of my input params I get different 
code, but the same 2 suboptimal situations exist.

   0b:   48 f7 36                divq   (%rsi)
   0e:   73 02                   jae    12 <u64_divide+0x12>
   10:   ?? ??                   inc    %r8d

   Can't you substitute the "jae; inc %r8d" sequence with "adcl $0, %r8d"?

Thats a possibly, but the u64_divide function is not actually functional 
(it can't deal with 64bit divisors; but thats besides the point to what 
I was highlighting).  Another reason why its not functional is that the 
processor flags on i386 are undefined after a DIV instruction anyway.

The carry check code was actually hi-jacked from my 
uint32_nowrap_add(u_int32_t *dest, u_int32_t addvalue) function which 
does want to know about carry for overflow purposes and your suggestion 
looks good.

   You can use "rm" for such a constraint.

Tested and working.  Thanks.

Another concern that occurs to me is that if the __asm__ constraints are 
not 100% perfect is there anyway to test/permutate every possible way 
for the compiler might generate the code.

   I suppose you could write a script which outputs "calls" to the asm
construct with a constant, local variable (which we assume will end up in a
register) or global variable for each operand in turn, then try compiling
and assembling (i.e. -c) the resulting code.

My thinking was for GCC to facilitate some sort of automated testing, 
which would then help everyone on every platform, especially if I was to 
then try and create inline-able versions of functions using __asm__.

I would imagine with the generated symbol approach you could easily make 
a DLL with many versions within it, load it into a test-harness program, 
lookup the symbols and execute every permutation of the function and 
verify the result.  Couple this with say valgrind and it maybe even 
possible to verify exactly what memory is read/written to.

All this would add confidence and eliminate a whole lot of possible 
uncertainty.

        xorl    %r8d, %r8d      # 44    *movdi_xor_rex64        [length 
= 3]
        movq    %rdx, %r9       # 8     *movdi_1_rex64/2        [length 
= 6]
        pushq   %rbx    # 38    *pushdi2_rex64/1        [length = 1]
.LCFI0:
        movq    (%rdi), %rax    # 16    *movdi_1_rex64/2        [length 
= 6]
        movl    %r8d, %edx      # 37    *movsi_1/1      [length = 3]
#APP

        xorl %ebx,%ebx
        divq (%rsi)
        jnc 1f
        incl %ebx
1:
        movq %rax,(%r9)
        movq %rdx,(%rcx)

#NO_APP
        movl    %ebx, %eax      # 36    *movsi_1/1      [length = 2]
        popq    %rbx    # 41    popdi1  [length = 1]
        ret     # 42    return_internal [length = 1]

Recapping on the original issues:

[1] failure to treat setting a register to the value of zero as a 
special case (since there maybe many ways to achieve this on a given 
CPU, different methods have different trades, insn length, unwanted side 
effects) which may allow this operation a lot of freedom for moving / 
scheduling.

[3] usage of %ebx when %r8d would have been a better choice, at the time 
%ebx is needed to be allocated the lifetime of the temporary use of %r8d 
was over.  i.e. allocating of registers which form outputs but not 
inputs should take place last thing (at the moment of #APP) maybe by 
doing this %r8d would have been a candidate ?  which would negate the 
need for the push/pop's.

Thanks for your thoughts.  Maybe I am just expecting too much.

Darryl