Rask Ingemann Lambertsen wrote:
On Tue, Aug 28, 2007 at 11:02:49PM +0100, Darryl Miles wrote:
Peephole definitions check for cases like this and won't do the
optimization clobbering the flags register if the flags register is live at
that point.
So I take it that peephole works by knowing the instructions emitted
with annotations about the lifetimes of registers / flags / other useful
stuff to help it. I was thinking it was a bit more blind to things than
that.
If this is the case then that leads me to believe setting %edx to 0
should have a lot of options open to achieve that goal.
0000000000000090 <u64_divide>:
00: 49 89 d1 mov %rdx,%r9 <<- [1] save %rdx in
%r9 for arg-as-return
03: 48 8b 07 mov (%rdi),%rax
06: ?? ?? ?? xor %edx,%edx <<- implicit zero of
high 32bits, would accept xorq %rdx,%rdx
Right, that's why I suggest using "gcc -S -dp" because then it clearly
shows if it's a 32-bit (*movsi_xxx) or a 64-bit (*movdi_xxx) instruction (as
seen from GCC's point of view, since the actual CPU instruction is the same
in this and several other cases).
The code you are quoting is not generated by GCC but my ideal
expectation of GCC, see the original u64_divide.c source comment of what
GCC emits; at the end of this email is the "-O6 -S -dp" version from GCC
4.0.2.
I did not understand the relevance to knowing if it is (*movsi_xxx) or
(*movdi_xxx). From my point of view knowing that would not alter the 2
original points I was making [1] and [3]. Maybe there is some
pipelining (or other complex) issue I don't know about which makes the
emitted code better than what I'm suggesting.
Interestingly if I change the order of my input params I get different
code, but the same 2 suboptimal situations exist.
0b: 48 f7 36 divq (%rsi)
0e: 73 02 jae 12 <u64_divide+0x12>
10: ?? ?? inc %r8d
Can't you substitute the "jae; inc %r8d" sequence with "adcl $0, %r8d"?
Thats a possibly, but the u64_divide function is not actually functional
(it can't deal with 64bit divisors; but thats besides the point to what
I was highlighting). Another reason why its not functional is that the
processor flags on i386 are undefined after a DIV instruction anyway.
The carry check code was actually hi-jacked from my
uint32_nowrap_add(u_int32_t *dest, u_int32_t addvalue) function which
does want to know about carry for overflow purposes and your suggestion
looks good.
You can use "rm" for such a constraint.
Tested and working. Thanks.
Another concern that occurs to me is that if the __asm__ constraints are
not 100% perfect is there anyway to test/permutate every possible way
for the compiler might generate the code.
I suppose you could write a script which outputs "calls" to the asm
construct with a constant, local variable (which we assume will end up in a
register) or global variable for each operand in turn, then try compiling
and assembling (i.e. -c) the resulting code.
My thinking was for GCC to facilitate some sort of automated testing,
which would then help everyone on every platform, especially if I was to
then try and create inline-able versions of functions using __asm__.
I would imagine with the generated symbol approach you could easily make
a DLL with many versions within it, load it into a test-harness program,
lookup the symbols and execute every permutation of the function and
verify the result. Couple this with say valgrind and it maybe even
possible to verify exactly what memory is read/written to.
All this would add confidence and eliminate a whole lot of possible
uncertainty.
xorl %r8d, %r8d # 44 *movdi_xor_rex64 [length
= 3]
movq %rdx, %r9 # 8 *movdi_1_rex64/2 [length
= 6]
pushq %rbx # 38 *pushdi2_rex64/1 [length = 1]
.LCFI0:
movq (%rdi), %rax # 16 *movdi_1_rex64/2 [length
= 6]
movl %r8d, %edx # 37 *movsi_1/1 [length = 3]
#APP
xorl %ebx,%ebx
divq (%rsi)
jnc 1f
incl %ebx
1:
movq %rax,(%r9)
movq %rdx,(%rcx)
#NO_APP
movl %ebx, %eax # 36 *movsi_1/1 [length = 2]
popq %rbx # 41 popdi1 [length = 1]
ret # 42 return_internal [length = 1]
Recapping on the original issues:
[1] failure to treat setting a register to the value of zero as a
special case (since there maybe many ways to achieve this on a given
CPU, different methods have different trades, insn length, unwanted side
effects) which may allow this operation a lot of freedom for moving /
scheduling.
[3] usage of %ebx when %r8d would have been a better choice, at the time
%ebx is needed to be allocated the lifetime of the temporary use of %r8d
was over. i.e. allocating of registers which form outputs but not
inputs should take place last thing (at the moment of #APP) maybe by
doing this %r8d would have been a candidate ? which would negate the
need for the push/pop's.
Thanks for your thoughts. Maybe I am just expecting too much.
Darryl