Re: GCC asm block optimizations on x86_64

Darryl Miles <darryl-mailinglists@xxxxxxxxxxxx> · Tue, 28 Aug 2007 23:02:49 +0100

Rask Ingemann Lambertsen wrote:
On Mon, Aug 27, 2007 at 06:11:04AM +0100, Darryl L. Miles wrote:
[1] This issue is in the way %edx is zero'ed, I would think zeroing out 
registers/memory/whatever would be a special optimization case in this 
code its clear that there is no useful value in the CPU condition flags, 
so "xorl %edx,%edx" would make most sense, instead of having to find 
another register to load with zero before then copying.  Interestingly 
enough -O generates a "mov $0,%r8d", while -O2 generates a "xor %r8d,%r8d".

   Peephole optimization isn't performed at -O.

   It is usually better to post asm output from "gcc -S -dp" than "objdump
--disassemble" output because the former shows which instruction pattern GCC
is using.

Thanks for the note on the peephole, can the peephole substitute
sequences when there is overlapping lifetimes of various processor 
features.  For example the 'flags' bits, you can't peephole a sequence 
that does a compare (setting flag bits) then loads a register with zero 
(not affecting flag bits) then does a branch based on flag bits, 
replacing the loads a register with zero with 'xor' on i386 would 
destroy the flags.

 0000000000000090 <u64_divide>:
   00:   49 89 d1                mov    %rdx,%r9	<<- [1] save %rdx in
%r9 for arg-as-return
   03:   48 8b 07                mov    (%rdi),%rax
   06:   ?? ?? ??                xor    %edx,%edx	<<- implicit zero of
high 32bits, would accept xorq %rdx,%rdx
   09:   ?? ??                   xor    %r8d,%r8d
   0b:   48 f7 36                divq   (%rsi)
   0e:   73 02                   jae    12 <u64_divide+0x12>
   10:   ?? ??                   inc    %r8d
   12:   49 89 01                mov    %rax,(%r9)	<<- [1] use saved
%rdx to return argument
   15:   48 89 11                mov    %rdx,(%rcx)
   18:   ?? ??                   mov    %r8d,%eax
   1a:   c3                      retq

Opps there was actually a few errors in the hand optimized version, so 
the above version is fixed.  The return from function is 32bit wide so 
%r8d is the correct register to select.  The insn at offset 0x18 should 
not have reference %ebx but %r8/%r8d.  Also the insn at offset 0x06 is 
probably only 2 bytes long.

I also did not say which version of GCC I was using, it was 4.0.2, but 
I've just tried with 4.2.1 and the same code is generated, although -O6 
appears to try and inline things further which lead me to find an 
invalid constraint "g" ((*divisor)) should be "r" ((*divisor)).  Since 
it tried to use a constant, although a register or memory via indirected 
register is valid here.

Another concern that occurs to me is that if the __asm__ constraints are 
not 100% perfect is there anyway to test/permutate every possible way 
for the compiler might generate the code.

The main things are that if I have given a register or memory or 
constant constraint, I'd like to know if all 3 versions would assemble. 
 The number of possible permutations for selection would multiply up 
but at least I could know for sure the constraints are correct.

This would need GCC to run in a special mode, maybe I could give the 
name of the symbol/function which I wanted it to do its work on and the 
generated code would emit multiple instances of that symbol with a 
counter appended to the symbol name.

gcc -c -o /tmp/testit.o -fasm-block-permutate=u64_divide 
-fasm-block-depth=all testit.c

Where "-fasm-block-permutate=u64_divide" earmarks which code wants 
special treatment.

Where "-fasm-block-depth=all" is some way of describing how deep you 
want the permutations to go.  Possibly to the point of mathematically 
certainty.

Then in the generated /tmp/testit.o I would get symbols:

u64_divide  <-- this would be the default code gen
u64_divide_0000001  <-- this would be code gen for auto generated case 1
u64_divide_0000002  <-- this would be code gen for auto generated case 2

Then having annotated code like with "-S" or "-S -dp" explaining what 
the criteria for the auto-generated cases are.

Just a thought,

Darryl