Re: GCC asm block optimizations on x86_64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Rask Ingemann Lambertsen wrote:
On Tue, Aug 28, 2007 at 11:02:49PM +0100, Darryl Miles wrote:
   Peephole definitions check for cases like this and won't do the
optimization clobbering the flags register if the flags register is live at
that point.

So I take it that peephole works by knowing the instructions emitted with annotations about the lifetimes of registers / flags / other useful stuff to help it. I was thinking it was a bit more blind to things than that.

If this is the case then that leads me to believe setting %edx to 0 should have a lot of options open to achieve that goal.


 0000000000000090 <u64_divide>:
   00:   49 89 d1                mov    %rdx,%r9	<<- [1] save %rdx in
%r9 for arg-as-return
   03:   48 8b 07                mov    (%rdi),%rax
   06:   ?? ?? ??                xor    %edx,%edx	<<- implicit zero of
high 32bits, would accept xorq %rdx,%rdx

   Right, that's why I suggest using "gcc -S -dp" because then it clearly
shows if it's a 32-bit (*movsi_xxx) or a 64-bit (*movdi_xxx) instruction (as
seen from GCC's point of view, since the actual CPU instruction is the same
in this and several other cases).

The code you are quoting is not generated by GCC but my ideal expectation of GCC, see the original u64_divide.c source comment of what GCC emits; at the end of this email is the "-O6 -S -dp" version from GCC 4.0.2.

I did not understand the relevance to knowing if it is (*movsi_xxx) or (*movdi_xxx). From my point of view knowing that would not alter the 2 original points I was making [1] and [3]. Maybe there is some pipelining (or other complex) issue I don't know about which makes the emitted code better than what I'm suggesting.

Interestingly if I change the order of my input params I get different code, but the same 2 suboptimal situations exist.



   0b:   48 f7 36                divq   (%rsi)
   0e:   73 02                   jae    12 <u64_divide+0x12>
   10:   ?? ??                   inc    %r8d

   Can't you substitute the "jae; inc %r8d" sequence with "adcl $0, %r8d"?

Thats a possibly, but the u64_divide function is not actually functional (it can't deal with 64bit divisors; but thats besides the point to what I was highlighting). Another reason why its not functional is that the processor flags on i386 are undefined after a DIV instruction anyway.

The carry check code was actually hi-jacked from my uint32_nowrap_add(u_int32_t *dest, u_int32_t addvalue) function which does want to know about carry for overflow purposes and your suggestion looks good.


   You can use "rm" for such a constraint.

Tested and working.  Thanks.



Another concern that occurs to me is that if the __asm__ constraints are not 100% perfect is there anyway to test/permutate every possible way for the compiler might generate the code.

   I suppose you could write a script which outputs "calls" to the asm
construct with a constant, local variable (which we assume will end up in a
register) or global variable for each operand in turn, then try compiling
and assembling (i.e. -c) the resulting code.

My thinking was for GCC to facilitate some sort of automated testing, which would then help everyone on every platform, especially if I was to then try and create inline-able versions of functions using __asm__.

I would imagine with the generated symbol approach you could easily make a DLL with many versions within it, load it into a test-harness program, lookup the symbols and execute every permutation of the function and verify the result. Couple this with say valgrind and it maybe even possible to verify exactly what memory is read/written to.

All this would add confidence and eliminate a whole lot of possible uncertainty.



xorl %r8d, %r8d # 44 *movdi_xor_rex64 [length = 3] movq %rdx, %r9 # 8 *movdi_1_rex64/2 [length = 6]
        pushq   %rbx    # 38    *pushdi2_rex64/1        [length = 1]
.LCFI0:
movq (%rdi), %rax # 16 *movdi_1_rex64/2 [length = 6]
        movl    %r8d, %edx      # 37    *movsi_1/1      [length = 3]
#APP

        xorl %ebx,%ebx
        divq (%rsi)
        jnc 1f
        incl %ebx
1:
        movq %rax,(%r9)
        movq %rdx,(%rcx)

#NO_APP
        movl    %ebx, %eax      # 36    *movsi_1/1      [length = 2]
        popq    %rbx    # 41    popdi1  [length = 1]
        ret     # 42    return_internal [length = 1]


Recapping on the original issues:

[1] failure to treat setting a register to the value of zero as a special case (since there maybe many ways to achieve this on a given CPU, different methods have different trades, insn length, unwanted side effects) which may allow this operation a lot of freedom for moving / scheduling.

[3] usage of %ebx when %r8d would have been a better choice, at the time %ebx is needed to be allocated the lifetime of the temporary use of %r8d was over. i.e. allocating of registers which form outputs but not inputs should take place last thing (at the moment of #APP) maybe by doing this %r8d would have been a candidate ? which would negate the need for the push/pop's.


Thanks for your thoughts.  Maybe I am just expecting too much.

Darryl

[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux