Hi, I am trying to vectorize a piece of code using SSE 2 intrinsics (the one's in emmintrin.h).I am using double precision floating point arithmetic.The running times I obtained were very similar with and without the vectorization. I suspect the reason for this is that in the vectorized code, I am storing the contents of a packed xmm register (represented by an __m128d variable) into a double array. Looking into the assembly code generated, I saw that for this, the contents of the xmm register were first saved to a memory location and then loaded into the x87 FPU stack. Apparently there is no direct way to transfer data between x87 and xmm registers. One way to eliminate this would be to use xmm registers for all floating point calculations. But inspite of using -march=prescott and -mfpmath=sse, x87 instructions like fld and fstp are still used. Is there any to force GCC to use only the xmm registers for all floating point calculations?(I tried using the -mno-80387 option but I am getting lots of weird linker errors with that). Or is there anyway to move data between x87 and xmm registers without using memory as an intermediary ? Regards Gautam