Rask Ingemann Lambertsen wrote:
On Mon, Aug 27, 2007 at 06:11:04AM +0100, Darryl L. Miles wrote:
[1] This issue is in the way %edx is zero'ed, I would think zeroing out
registers/memory/whatever would be a special optimization case in this
code its clear that there is no useful value in the CPU condition flags,
so "xorl %edx,%edx" would make most sense, instead of having to find
another register to load with zero before then copying. Interestingly
enough -O generates a "mov $0,%r8d", while -O2 generates a "xor %r8d,%r8d".
Peephole optimization isn't performed at -O.
It is usually better to post asm output from "gcc -S -dp" than "objdump
--disassemble" output because the former shows which instruction pattern GCC
is using.
Thanks for the note on the peephole, can the peephole substitute
sequences when there is overlapping lifetimes of various processor
features. For example the 'flags' bits, you can't peephole a sequence
that does a compare (setting flag bits) then loads a register with zero
(not affecting flag bits) then does a branch based on flag bits,
replacing the loads a register with zero with 'xor' on i386 would
destroy the flags.
0000000000000090 <u64_divide>:
00: 49 89 d1 mov %rdx,%r9 <<- [1] save %rdx in
%r9 for arg-as-return
03: 48 8b 07 mov (%rdi),%rax
06: ?? ?? ?? xor %edx,%edx <<- implicit zero of
high 32bits, would accept xorq %rdx,%rdx
09: ?? ?? xor %r8d,%r8d
0b: 48 f7 36 divq (%rsi)
0e: 73 02 jae 12 <u64_divide+0x12>
10: ?? ?? inc %r8d
12: 49 89 01 mov %rax,(%r9) <<- [1] use saved
%rdx to return argument
15: 48 89 11 mov %rdx,(%rcx)
18: ?? ?? mov %r8d,%eax
1a: c3 retq
Opps there was actually a few errors in the hand optimized version, so
the above version is fixed. The return from function is 32bit wide so
%r8d is the correct register to select. The insn at offset 0x18 should
not have reference %ebx but %r8/%r8d. Also the insn at offset 0x06 is
probably only 2 bytes long.
I also did not say which version of GCC I was using, it was 4.0.2, but
I've just tried with 4.2.1 and the same code is generated, although -O6
appears to try and inline things further which lead me to find an
invalid constraint "g" ((*divisor)) should be "r" ((*divisor)). Since
it tried to use a constant, although a register or memory via indirected
register is valid here.
Another concern that occurs to me is that if the __asm__ constraints are
not 100% perfect is there anyway to test/permutate every possible way
for the compiler might generate the code.
The main things are that if I have given a register or memory or
constant constraint, I'd like to know if all 3 versions would assemble.
The number of possible permutations for selection would multiply up
but at least I could know for sure the constraints are correct.
This would need GCC to run in a special mode, maybe I could give the
name of the symbol/function which I wanted it to do its work on and the
generated code would emit multiple instances of that symbol with a
counter appended to the symbol name.
gcc -c -o /tmp/testit.o -fasm-block-permutate=u64_divide
-fasm-block-depth=all testit.c
Where "-fasm-block-permutate=u64_divide" earmarks which code wants
special treatment.
Where "-fasm-block-depth=all" is some way of describing how deep you
want the permutations to go. Possibly to the point of mathematically
certainty.
Then in the generated /tmp/testit.o I would get symbols:
u64_divide <-- this would be the default code gen
u64_divide_0000001 <-- this would be code gen for auto generated case 1
u64_divide_0000002 <-- this would be code gen for auto generated case 2
Then having annotated code like with "-S" or "-S -dp" explaining what
the criteria for the auto-generated cases are.
Just a thought,
Darryl