Hello, I am learning about the tradeoffs between using _mm intrinsics and
inline asm.
Briefly, I have noticed that at least one intrinsic does not support a
valid assembly addressing mode. I would be grateful if anyone can tell me
why this particular problem occurs, or if there is a better way to
approach similar situations. In general I'm trying to figure out what can
and cannot be done via the intel-style _mm intrinsics. I'm fairly new to
the intrinsics, inline asm, and sse ops. So far, I prefer intrinsics, if
available, since they allow gcc to schedule the ops. But I like control
too. :)
PROBLEM: (Sample code, compiler info, and objdumps are at the end.)
The intrinsic for the sse2 shift instruction psllq uses an extra register
in a situation that does not require two registers. The corresponding
inline asm compiles as intended, using a single register.
There is an addressing mode where the psllq shift amount can be taken from
an xmm register. It _is_ valid for both operand registers to be the same
physical register, as in: psllq %xmm0,%xmm0. In this case, the shift
amount is taken from the low bits of the low quadword, and then both
quadwords are shifted by that many bits.
The corresponding intrinsic is _mm_sll_epi64. I have been successful in
getting _mm_sll_epi64(xmmreg0, xmmreg1) to compile to: psllq %xmm0,%xmm1.
But I have been unable to coax gcc into compiling _mm_sll_epi64(xmmreg0,
xmmreg0) to: psllq %xmm0,%xmm0 where both operands are the same physical
register.
When coded using _mm_sll_epi64(xmmreg0, xmmreg0) gcc always allocates a
second intermediate register and copies the operand into both. It then
emits: psllq %xmm1,%xmm0.
Here, "xmmreg0" and "xmmreg1" refer to local "register" bound variables to
make the description cleaner. But the problem does not depend on this. See
CASE B and CASE C below.
Having both operands within the same register could be useful when
registers are scarce. However, coding this op with the gcc intrinsic
causes additional register pressure. (My initial interest in all this came
from wanting to emulate a full 128 bit shifter as efficiently as possible.
Unfortunately the 128b sse shift ops have byte granularity so quadword
shift ops have to be stitched together for most cases.)
EXAMPLES: (code, context, and objdumps are at the end.)
CASE A is the best I've done so far. Using extended inline asm, gcc / gas
has no problems with the overlapping operand registers.
CASE B is the direct mapping of CASE A into _mm intrinsic syntax (to the
best of my abilities). You can see the extra setup code emitted for the
duplicate register.
CASE C is the same as CASE B except I tried to pressure gcc into using
specific xmm registers by
creating intermediate local "register" variables. The result is identical
to CASE B.
There may be similar situations where the intrinsics don't compile to the
minimal-register addressing mode when the operands overlap. I've only
tinkered with psllq so far. I have not tried this with gcc 4.0 yet.
Thanks in advance for any comments!
-brice
----COMMON CONTEXT---------------------------------------------------------
#gcc flags set from within perl Inline::C module
use Inline C => Config =>
CC => '/usr/bin/gcc-3.4'
CCFLAGS => '-march=i386 -mmmx -msse -msse2 -msse3'
OPTIMIZE => '-O2'
#--verison = gcc-3.4 (GCC) 3.4.4 20050314 (prerelease) (Debian 3.4.3-12)
/* C header stuff common to all examples */
#include <mmintrin.h>
#include <xmmintrin.h>
#include <emmintrin.h>
#include <pmmintrin.h>
typedef union {
__m128i full128b;
unsigned int words32b[4];
} reg128b;
reg128b vec; //global
----CASE A ----------------------------------------------------------------
void dommstuff () {
__asm__ __volatile__ ( "psllq %0, %0"
: "=x" (vec.full128b)
: "x" (vec.full128b) );
}
00001298 <dommstuff>:
1298: e8 00 00 00 00 call 129d <dommstuff+0x5>
129d: 59 pop %ecx
129e: 81 c1 ff 27 00 00 add $0x27ff,%ecx
12a4: 55 push %ebp
12a5: 8b 81 84 00 00 00 mov 0x84(%ecx),%eax
12ab: 89 e5 mov %esp,%ebp
12ad: 66 0f 6f 00 movdqa (%eax),%xmm0
12b1: 66 0f f3 c0 >>>>>>>>>>> psllq %xmm0,%xmm0 <<<<<<<<<<<
12b5: 66 0f 7f 00 movdqa %xmm0,(%eax)
12b9: c9 leave
12ba: c3 ret
12bb: 90 nop
----CASE B ----------------------------------------------------------------
void dommstuff () {
vec.full128b = _mm_sll_epi64(vec.full128b, vec.full128b);
}
00001298 <dommstuff>:
1298: 55 push %ebp
1299: 89 e5 mov %esp,%ebp
129b: e8 00 00 00 00 call 12a0 <dommstuff+0x8>
12a0: 59 pop %ecx
12a1: 81 c1 fc 27 00 00 add $0x27fc,%ecx
12a7: 83 ec 18 sub $0x18,%esp
12aa: 8b 81 84 00 00 00 mov 0x84(%ecx),%eax
12b0: 66 0f 6f 00 movdqa (%eax),%xmm0
12b4: 66 0f 7f 45 e8 movdqa %xmm0,0xffffffe8(%ebp)
12b9: 66 0f 6e 4d e8 movd 0xffffffe8(%ebp),%xmm1
12be: 66 0f f3 c1 >>>>>>>>>>> psllq %xmm1,%xmm0 <<<<<<<<<<<
12c2: 66 0f 7f 00 movdqa %xmm0,(%eax)
12c6: c9 leave
12c7: c3 ret
----CASE C ----------------------------------------------------------------
void dommstuff () {
register __m128i xmmreg0 asm ("%xmm0");
xmmreg0 = vec.full128b;
vec.full128b = _mm_sll_epi64(xmmreg0, xmmreg0);
}
00001298 <dommstuff>:
1298: 55 push %ebp
1299: 89 e5 mov %esp,%ebp
129b: e8 00 00 00 00 call 12a0 <dommstuff+0x8>
12a0: 59 pop %ecx
12a1: 81 c1 fc 27 00 00 add $0x27fc,%ecx
12a7: 83 ec 18 sub $0x18,%esp
12aa: 8b 81 84 00 00 00 mov 0x84(%ecx),%eax
12b0: 66 0f 6f 00 movdqa (%eax),%xmm0
12b4: 66 0f 7f 45 e8 movdqa %xmm0,0xffffffe8(%ebp)
12b9: 66 0f 6e 4d e8 movd 0xffffffe8(%ebp),%xmm1
12be: 66 0f f3 c1 >>>>>>>>>>> psllq %xmm1,%xmm0 <<<<<<<<<<<
12c2: 66 0f 7f 00 movdqa %xmm0,(%eax)
12c6: c9 leave
12c7: c3 ret
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/