As I mentioned, I am using intrinsics. (Intel SSE 2 intrinsics in the emmintrin.h file to be specific). I do not wish to transfer data between x87 and xmm registers, but when I am moving a __m128d variable (a data type for use with the SSE2 intrinsics), to a 2-element double array (to perform some calculation on each double individually) and gcc is using generating x87 FPU code for that. I do not want to use the x87 FPU at all, because as you said, there is no way of moving data between x87 and XMM registers without going through memory. Therefore I want to know a method/compiler-switch etc which will cause gcc to *not* generate x87 FPU code. On Thu, Jun 5, 2008 at 3:11 AM, Tim Prince <TimothyPrince@xxxxxxxxxxxxx> wrote: > Gautam Sewani wrote: >> >> Hi, >> I am trying to vectorize a piece of code using SSE 2 intrinsics (the >> one's in emmintrin.h).I am using double precision floating point >> arithmetic.The running times I obtained were very similar with and >> without the vectorization. I suspect the reason for this is that in >> the vectorized code, I am storing the contents of a packed xmm >> register (represented by an __m128d variable) into a double array. >> >> Looking into the assembly code generated, I saw that for this, the >> contents of the xmm register were first saved to a memory location and >> then loaded into the x87 FPU stack. Apparently there is no direct way >> to transfer data between x87 and xmm registers. One way to eliminate >> this would be to use xmm registers for all floating point >> calculations. But inspite of using -march=prescott and -mfpmath=sse, >> x87 instructions like fld and fstp are still used. Is there any to >> force GCC to use only the xmm registers for all floating point >> calculations?(I tried using the -mno-80387 option but I am getting >> lots of weird linker errors with that). Or is there anyway to move >> data between x87 and xmm registers without using memory as an >> intermediary ? > > Moves between x87 and xmm registers must always go through memory. I'm not > clear on why you want to use x87 registers in vectorized code, or whether > you really need intrinsics rather than auto-vectorization. Using gcc, if > you have a combination of vectorizable and non-vectorizable code, to get a > benefit from vectorization, you must split your loops so that you have > entirely vectorizable code in loops which you want speeded up. gcc doesn't > "distribute" automatically for vectorization. > gcc auto-vectorization has been improving in recent versions. >