_mm_movpi64_epi64 does not generate MOVQ2DQ

James Milne <jamesm@xxxxxxxxxxxxxxxx> · Fri, 19 May 2006 14:48:41 +0100

Hello,

I have some code written in SSE2 intrinsics that is compiled with GCC 4.1.0, and
I've been profiling it with Intel's VTune 8.0.

I'm unpacking some interleaved data into planar form, and due to the nature of
the packing I'm going through the MMX registers first, before moving into the
XMM registers.

At the point where I want to move my data from MMX to XMM registers, I'm calling
_mm_movpi64_epi64(). Ideally, this ought to generate a MOVQ2DQ instruction, but
instead GCC is saving the value from the MMX register to the stack, then loading
that value back into a XMM register.

The assembly generated is this:

mov $0x0, -56(%ebp)
movq %mm0, -88(%ebp)
movq -88(%ebp), xmm3
movhps -56(%ebp), xmm3

I would have expected to see this:

movq2dq %mm0, %xmm3

The issue is that the VTune informs me that the former assembly being generated 
is blocking store-forwarding and introducing a large stall in my code. This is 
in the inner loop of some image processing code.

There's not exactly much register pressure, since my register usage is 
distributed about 50/50 between MMX and XMM, and I'm only using half of each 
register set.

Has anyone else seen similar behaviour? Is this something that is preventing GCC 
issuing the MOVQ2DQ. I'm building with -msse2.

--
Kind regards
James Milne