Inline Asm vs. MM Intrinsic: register conflict for a valid addressing mode

Brice <bricedue@xxxxxxxxxxx> · Sat, 07 Jan 2006 09:31:59 -0500

Hello, I am learning about the tradeoffs between using _mm intrinsics and  
inline asm.

Briefly, I have noticed that at least one intrinsic does not support a  
valid assembly addressing mode. I would be grateful if anyone can tell me  
why this particular problem occurs, or if there is a better way to  
approach similar situations. In general I'm trying to figure out what can  
and cannot be done via the intel-style _mm intrinsics. I'm fairly new to  
the intrinsics, inline asm, and sse ops. So far, I prefer intrinsics, if  
available, since they allow gcc to schedule the ops. But I like control  
too. :)

PROBLEM: (Sample code, compiler info, and objdumps are at the end.)

The intrinsic for the sse2 shift instruction psllq uses an extra register  
in a situation that does not require two registers. The corresponding  
inline asm compiles as intended, using a single register.

There is an addressing mode where the psllq shift amount can be taken from  
an xmm register. It _is_ valid for both operand registers to be the same  
physical register, as in: psllq %xmm0,%xmm0. In this case, the shift  
amount is taken from the low bits of the low quadword, and then both  
quadwords are shifted by that many bits.

The corresponding intrinsic is _mm_sll_epi64. I have been successful in  
getting _mm_sll_epi64(xmmreg0, xmmreg1) to compile to: psllq %xmm0,%xmm1.  
But I have been unable to coax gcc into compiling _mm_sll_epi64(xmmreg0,  
xmmreg0) to: psllq %xmm0,%xmm0 where both operands are the same physical  
register.

When coded using _mm_sll_epi64(xmmreg0, xmmreg0) gcc always allocates a  
second intermediate register and copies the operand into both. It then  
emits: psllq %xmm1,%xmm0.

Here, "xmmreg0" and "xmmreg1" refer to local "register" bound variables to  
make the description cleaner. But the problem does not depend on this. See  
CASE B and CASE C below.

Having both operands within the same register could be useful when  
registers are scarce. However, coding this op with the gcc intrinsic  
causes additional register pressure. (My initial interest in all this came  
from wanting to emulate a full 128 bit shifter as efficiently as possible.  
Unfortunately the 128b sse shift ops have byte granularity so quadword  
shift ops have to be stitched together for most cases.)

EXAMPLES: (code, context, and objdumps are at the end.)

CASE A is the best I've done so far. Using extended inline asm, gcc / gas  
has no problems with the overlapping operand registers.

CASE B is the direct mapping of CASE A into _mm intrinsic syntax (to the  
best of my abilities). You can see the extra setup code emitted for the  
duplicate register.

CASE C is the same as CASE B except I tried to pressure gcc into using  
specific xmm registers by
creating intermediate local "register" variables. The result is identical  
to CASE B.

There may be similar situations where the intrinsics don't compile to the  
minimal-register addressing mode when the operands overlap. I've only  
tinkered with psllq so far. I have not tried this with gcc 4.0 yet.

Thanks in advance for any comments!
-brice

----COMMON CONTEXT---------------------------------------------------------

#gcc flags set from within perl Inline::C module
    use Inline C => Config =>
        CC => '/usr/bin/gcc-3.4'
        CCFLAGS => '-march=i386 -mmmx -msse -msse2 -msse3'
        OPTIMIZE => '-O2'
#--verison = gcc-3.4 (GCC) 3.4.4 20050314 (prerelease) (Debian 3.4.3-12)

/* C header stuff common to all examples */
    #include <mmintrin.h>
    #include <xmmintrin.h>
    #include <emmintrin.h>
    #include <pmmintrin.h>

    typedef union {
        __m128i full128b;
        unsigned int words32b[4];
    } reg128b;

    reg128b vec; //global

----CASE A ----------------------------------------------------------------

void dommstuff () {
__asm__ __volatile__ (  "psllq %0, %0"
						: "=x" (vec.full128b)
						: "x"  (vec.full128b)     );
}
00001298 <dommstuff>:
	1298:    e8 00 00 00 00           call   129d <dommstuff+0x5>
	129d:    59                       pop    %ecx
	129e:    81 c1 ff 27 00 00        add    $0x27ff,%ecx
	12a4:    55                       push   %ebp
	12a5:    8b 81 84 00 00 00        mov    0x84(%ecx),%eax
	12ab:    89 e5                    mov    %esp,%ebp
	12ad:    66 0f 6f 00              movdqa (%eax),%xmm0
	12b1:    66 0f f3 c0  >>>>>>>>>>> psllq  %xmm0,%xmm0 <<<<<<<<<<<
	12b5:    66 0f 7f 00              movdqa %xmm0,(%eax)
	12b9:    c9                       leave
	12ba:    c3                       ret
	12bb:    90                       nop

----CASE B ----------------------------------------------------------------

void dommstuff () {
	vec.full128b = _mm_sll_epi64(vec.full128b, vec.full128b);
}
00001298 <dommstuff>:
	1298:    55                       push   %ebp
	1299:    89 e5                    mov    %esp,%ebp
	129b:    e8 00 00 00 00           call   12a0 <dommstuff+0x8>
	12a0:    59                       pop    %ecx
	12a1:    81 c1 fc 27 00 00        add    $0x27fc,%ecx
	12a7:    83 ec 18                 sub    $0x18,%esp
	12aa:    8b 81 84 00 00 00        mov    0x84(%ecx),%eax
	12b0:    66 0f 6f 00              movdqa (%eax),%xmm0
	12b4:    66 0f 7f 45 e8           movdqa %xmm0,0xffffffe8(%ebp)
	12b9:    66 0f 6e 4d e8           movd   0xffffffe8(%ebp),%xmm1
	12be:    66 0f f3 c1  >>>>>>>>>>> psllq  %xmm1,%xmm0 <<<<<<<<<<<
	12c2:    66 0f 7f 00              movdqa %xmm0,(%eax)
	12c6:    c9                       leave
	12c7:    c3                       ret

----CASE C ----------------------------------------------------------------

void dommstuff () {
	register __m128i xmmreg0 asm ("%xmm0");
	xmmreg0 = vec.full128b;
	vec.full128b = _mm_sll_epi64(xmmreg0, xmmreg0);
}
00001298 <dommstuff>:
	1298:    55                       push   %ebp
	1299:    89 e5                    mov    %esp,%ebp
	129b:    e8 00 00 00 00           call   12a0 <dommstuff+0x8>
	12a0:    59                       pop    %ecx
	12a1:    81 c1 fc 27 00 00        add    $0x27fc,%ecx
	12a7:    83 ec 18                 sub    $0x18,%esp
	12aa:    8b 81 84 00 00 00        mov    0x84(%ecx),%eax
	12b0:    66 0f 6f 00              movdqa (%eax),%xmm0
	12b4:    66 0f 7f 45 e8           movdqa %xmm0,0xffffffe8(%ebp)
	12b9:    66 0f 6e 4d e8           movd   0xffffffe8(%ebp),%xmm1
	12be:    66 0f f3 c1  >>>>>>>>>>> psllq  %xmm1,%xmm0 <<<<<<<<<<<
	12c2:    66 0f 7f 00              movdqa %xmm0,(%eax)
	12c6:    c9                       leave
	12c7:    c3                       ret

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/