Re: GCC asm block optimizations on x86_64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Rask Ingemann Lambertsen wrote:
On Mon, Aug 27, 2007 at 06:11:04AM +0100, Darryl L. Miles wrote:
[1] This issue is in the way %edx is zero'ed, I would think zeroing out registers/memory/whatever would be a special optimization case in this code its clear that there is no useful value in the CPU condition flags, so "xorl %edx,%edx" would make most sense, instead of having to find another register to load with zero before then copying. Interestingly enough -O generates a "mov $0,%r8d", while -O2 generates a "xor %r8d,%r8d".

   Peephole optimization isn't performed at -O.

   It is usually better to post asm output from "gcc -S -dp" than "objdump
--disassemble" output because the former shows which instruction pattern GCC
is using.



Thanks for the note on the peephole, can the peephole substitute
sequences when there is overlapping lifetimes of various processor features. For example the 'flags' bits, you can't peephole a sequence that does a compare (setting flag bits) then loads a register with zero (not affecting flag bits) then does a branch based on flag bits, replacing the loads a register with zero with 'xor' on i386 would destroy the flags.



 0000000000000090 <u64_divide>:
   00:   49 89 d1                mov    %rdx,%r9	<<- [1] save %rdx in
%r9 for arg-as-return
   03:   48 8b 07                mov    (%rdi),%rax
   06:   ?? ?? ??                xor    %edx,%edx	<<- implicit zero of
high 32bits, would accept xorq %rdx,%rdx
   09:   ?? ??                   xor    %r8d,%r8d
   0b:   48 f7 36                divq   (%rsi)
   0e:   73 02                   jae    12 <u64_divide+0x12>
   10:   ?? ??                   inc    %r8d
   12:   49 89 01                mov    %rax,(%r9)	<<- [1] use saved
%rdx to return argument
   15:   48 89 11                mov    %rdx,(%rcx)
   18:   ?? ??                   mov    %r8d,%eax
   1a:   c3                      retq

Opps there was actually a few errors in the hand optimized version, so the above version is fixed. The return from function is 32bit wide so %r8d is the correct register to select. The insn at offset 0x18 should not have reference %ebx but %r8/%r8d. Also the insn at offset 0x06 is probably only 2 bytes long.


I also did not say which version of GCC I was using, it was 4.0.2, but I've just tried with 4.2.1 and the same code is generated, although -O6 appears to try and inline things further which lead me to find an invalid constraint "g" ((*divisor)) should be "r" ((*divisor)). Since it tried to use a constant, although a register or memory via indirected register is valid here.


Another concern that occurs to me is that if the __asm__ constraints are not 100% perfect is there anyway to test/permutate every possible way for the compiler might generate the code.

The main things are that if I have given a register or memory or constant constraint, I'd like to know if all 3 versions would assemble. The number of possible permutations for selection would multiply up but at least I could know for sure the constraints are correct.

This would need GCC to run in a special mode, maybe I could give the name of the symbol/function which I wanted it to do its work on and the generated code would emit multiple instances of that symbol with a counter appended to the symbol name.


gcc -c -o /tmp/testit.o -fasm-block-permutate=u64_divide -fasm-block-depth=all testit.c

Where "-fasm-block-permutate=u64_divide" earmarks which code wants special treatment.

Where "-fasm-block-depth=all" is some way of describing how deep you want the permutations to go. Possibly to the point of mathematically certainty.


Then in the generated /tmp/testit.o I would get symbols:

u64_divide  <-- this would be the default code gen
u64_divide_0000001  <-- this would be code gen for auto generated case 1
u64_divide_0000002  <-- this would be code gen for auto generated case 2


Then having annotated code like with "-S" or "-S -dp" explaining what the criteria for the auto-generated cases are.


Just a thought,

Darryl

[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux