Inline Asm vs. MM Intrinsic: register conflict for a valid addressing mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello, I am learning about the tradeoffs between using _mm intrinsics and inline asm.

Briefly, I have noticed that at least one intrinsic does not support a valid assembly addressing mode. I would be grateful if anyone can tell me why this particular problem occurs, or if there is a better way to approach similar situations. In general I'm trying to figure out what can and cannot be done via the intel-style _mm intrinsics. I'm fairly new to the intrinsics, inline asm, and sse ops. So far, I prefer intrinsics, if available, since they allow gcc to schedule the ops. But I like control too. :)

PROBLEM: (Sample code, compiler info, and objdumps are at the end.)

The intrinsic for the sse2 shift instruction psllq uses an extra register in a situation that does not require two registers. The corresponding inline asm compiles as intended, using a single register.

There is an addressing mode where the psllq shift amount can be taken from an xmm register. It _is_ valid for both operand registers to be the same physical register, as in: psllq %xmm0,%xmm0. In this case, the shift amount is taken from the low bits of the low quadword, and then both quadwords are shifted by that many bits.

The corresponding intrinsic is _mm_sll_epi64. I have been successful in getting _mm_sll_epi64(xmmreg0, xmmreg1) to compile to: psllq %xmm0,%xmm1. But I have been unable to coax gcc into compiling _mm_sll_epi64(xmmreg0, xmmreg0) to: psllq %xmm0,%xmm0 where both operands are the same physical register.

When coded using _mm_sll_epi64(xmmreg0, xmmreg0) gcc always allocates a second intermediate register and copies the operand into both. It then emits: psllq %xmm1,%xmm0.

Here, "xmmreg0" and "xmmreg1" refer to local "register" bound variables to make the description cleaner. But the problem does not depend on this. See CASE B and CASE C below.

Having both operands within the same register could be useful when registers are scarce. However, coding this op with the gcc intrinsic causes additional register pressure. (My initial interest in all this came from wanting to emulate a full 128 bit shifter as efficiently as possible. Unfortunately the 128b sse shift ops have byte granularity so quadword shift ops have to be stitched together for most cases.)

EXAMPLES: (code, context, and objdumps are at the end.)

CASE A is the best I've done so far. Using extended inline asm, gcc / gas has no problems with the overlapping operand registers.

CASE B is the direct mapping of CASE A into _mm intrinsic syntax (to the best of my abilities). You can see the extra setup code emitted for the duplicate register.

CASE C is the same as CASE B except I tried to pressure gcc into using specific xmm registers by creating intermediate local "register" variables. The result is identical to CASE B.

There may be similar situations where the intrinsics don't compile to the minimal-register addressing mode when the operands overlap. I've only tinkered with psllq so far. I have not tried this with gcc 4.0 yet.

Thanks in advance for any comments!
-brice



----COMMON CONTEXT---------------------------------------------------------

#gcc flags set from within perl Inline::C module
    use Inline C => Config =>
        CC => '/usr/bin/gcc-3.4'
        CCFLAGS => '-march=i386 -mmmx -msse -msse2 -msse3'
        OPTIMIZE => '-O2'
#--verison = gcc-3.4 (GCC) 3.4.4 20050314 (prerelease) (Debian 3.4.3-12)

/* C header stuff common to all examples */
    #include <mmintrin.h>
    #include <xmmintrin.h>
    #include <emmintrin.h>
    #include <pmmintrin.h>

    typedef union {
        __m128i full128b;
        unsigned int words32b[4];
    } reg128b;

    reg128b vec; //global


----CASE A ----------------------------------------------------------------

void dommstuff () {
__asm__ __volatile__ (  "psllq %0, %0"
						: "=x" (vec.full128b)
						: "x"  (vec.full128b)     );
}
00001298 <dommstuff>:
	1298:    e8 00 00 00 00           call   129d <dommstuff+0x5>
	129d:    59                       pop    %ecx
	129e:    81 c1 ff 27 00 00        add    $0x27ff,%ecx
	12a4:    55                       push   %ebp
	12a5:    8b 81 84 00 00 00        mov    0x84(%ecx),%eax
	12ab:    89 e5                    mov    %esp,%ebp
	12ad:    66 0f 6f 00              movdqa (%eax),%xmm0
	12b1:    66 0f f3 c0  >>>>>>>>>>> psllq  %xmm0,%xmm0 <<<<<<<<<<<
	12b5:    66 0f 7f 00              movdqa %xmm0,(%eax)
	12b9:    c9                       leave
	12ba:    c3                       ret
	12bb:    90                       nop

	
----CASE B ----------------------------------------------------------------

void dommstuff () {
	vec.full128b = _mm_sll_epi64(vec.full128b, vec.full128b);
}
00001298 <dommstuff>:
	1298:    55                       push   %ebp
	1299:    89 e5                    mov    %esp,%ebp
	129b:    e8 00 00 00 00           call   12a0 <dommstuff+0x8>
	12a0:    59                       pop    %ecx
	12a1:    81 c1 fc 27 00 00        add    $0x27fc,%ecx
	12a7:    83 ec 18                 sub    $0x18,%esp
	12aa:    8b 81 84 00 00 00        mov    0x84(%ecx),%eax
	12b0:    66 0f 6f 00              movdqa (%eax),%xmm0
	12b4:    66 0f 7f 45 e8           movdqa %xmm0,0xffffffe8(%ebp)
	12b9:    66 0f 6e 4d e8           movd   0xffffffe8(%ebp),%xmm1
	12be:    66 0f f3 c1  >>>>>>>>>>> psllq  %xmm1,%xmm0 <<<<<<<<<<<
	12c2:    66 0f 7f 00              movdqa %xmm0,(%eax)
	12c6:    c9                       leave
	12c7:    c3                       ret


----CASE C ----------------------------------------------------------------

void dommstuff () {
	register __m128i xmmreg0 asm ("%xmm0");
	xmmreg0 = vec.full128b;
	vec.full128b = _mm_sll_epi64(xmmreg0, xmmreg0);
}
00001298 <dommstuff>:
	1298:    55                       push   %ebp
	1299:    89 e5                    mov    %esp,%ebp
	129b:    e8 00 00 00 00           call   12a0 <dommstuff+0x8>
	12a0:    59                       pop    %ecx
	12a1:    81 c1 fc 27 00 00        add    $0x27fc,%ecx
	12a7:    83 ec 18                 sub    $0x18,%esp
	12aa:    8b 81 84 00 00 00        mov    0x84(%ecx),%eax
	12b0:    66 0f 6f 00              movdqa (%eax),%xmm0
	12b4:    66 0f 7f 45 e8           movdqa %xmm0,0xffffffe8(%ebp)
	12b9:    66 0f 6e 4d e8           movd   0xffffffe8(%ebp),%xmm1
	12be:    66 0f f3 c1  >>>>>>>>>>> psllq  %xmm1,%xmm0 <<<<<<<<<<<
	12c2:    66 0f 7f 00              movdqa %xmm0,(%eax)
	12c6:    c9                       leave
	12c7:    c3                       ret


--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/


[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux