Gautam Sewani wrote:
Hi,
I am trying to vectorize a piece of code using SSE 2 intrinsics (the
one's in emmintrin.h).I am using double precision floating point
arithmetic.The running times I obtained were very similar with and
without the vectorization. I suspect the reason for this is that in
the vectorized code, I am storing the contents of a packed xmm
register (represented by an __m128d variable) into a double array.
Looking into the assembly code generated, I saw that for this, the
contents of the xmm register were first saved to a memory location and
then loaded into the x87 FPU stack. Apparently there is no direct way
to transfer data between x87 and xmm registers. One way to eliminate
this would be to use xmm registers for all floating point
calculations. But inspite of using -march=prescott and -mfpmath=sse,
x87 instructions like fld and fstp are still used. Is there any to
force GCC to use only the xmm registers for all floating point
calculations?(I tried using the -mno-80387 option but I am getting
lots of weird linker errors with that). Or is there anyway to move
data between x87 and xmm registers without using memory as an
intermediary ?
Moves between x87 and xmm registers must always go through memory. I'm
not clear on why you want to use x87 registers in vectorized code, or
whether you really need intrinsics rather than auto-vectorization. Using
gcc, if you have a combination of vectorizable and non-vectorizable code,
to get a benefit from vectorization, you must split your loops so that you
have entirely vectorizable code in loops which you want speeded up. gcc
doesn't "distribute" automatically for vectorization.
gcc auto-vectorization has been improving in recent versions.