Hello,
I have some code written in SSE2 intrinsics that is compiled with GCC 4.1.0, and
I've been profiling it with Intel's VTune 8.0.
I'm unpacking some interleaved data into planar form, and due to the nature of
the packing I'm going through the MMX registers first, before moving into the
XMM registers.
At the point where I want to move my data from MMX to XMM registers, I'm calling
_mm_movpi64_epi64(). Ideally, this ought to generate a MOVQ2DQ instruction, but
instead GCC is saving the value from the MMX register to the stack, then loading
that value back into a XMM register.
The assembly generated is this:
mov $0x0, -56(%ebp)
movq %mm0, -88(%ebp)
movq -88(%ebp), xmm3
movhps -56(%ebp), xmm3
I would have expected to see this:
movq2dq %mm0, %xmm3
The issue is that the VTune informs me that the former assembly being generated
is blocking store-forwarding and introducing a large stall in my code. This is
in the inner loop of some image processing code.
There's not exactly much register pressure, since my register usage is
distributed about 50/50 between MMX and XMM, and I'm only using half of each
register set.
Has anyone else seen similar behaviour? Is this something that is preventing GCC
issuing the MOVQ2DQ. I'm building with -msse2.
--
Kind regards
James Milne