That is very bad news indeed :-( . Can anyone confirm this with some testing? (I am using a Core duo, and don't have access to Core 2 Duo.) Regards Gautam On Thu, Jun 5, 2008 at 7:26 PM, Frédéric Bastien <nouiz@xxxxxxxxx> wrote: > Hi, > > With processor before core2 from intel, their was a bottleneck in the > CPU that make all sse instruction being split in two. So as you have > only two double in a sse instruction and if you have a processor with > such a bottleneck, I see only 1 way to have a speed up. Use float > instead of double. I know, this is not always an option. To my > knowledge prescott cpu have this bottleneck. > > Frederic Bastien > > On Thu, Jun 5, 2008 at 2:39 AM, Gautam Sewani <gautamcool88@xxxxxxxxx> wrote: >> Hi, >> I am trying to vectorize a piece of code using SSE 2 intrinsics (the >> one's in emmintrin.h).I am using double precision floating point >> arithmetic.The running times I obtained were very similar with and >> without the vectorization. I suspect the reason for this is that in >> the vectorized code, I am storing the contents of a packed xmm >> register (represented by an __m128d variable) into a double array. >> >> Looking into the assembly code generated, I saw that for this, the >> contents of the xmm register were first saved to a memory location and >> then loaded into the x87 FPU stack. Apparently there is no direct way >> to transfer data between x87 and xmm registers. One way to eliminate >> this would be to use xmm registers for all floating point >> calculations. But inspite of using -march=prescott and -mfpmath=sse, >> x87 instructions like fld and fstp are still used. Is there any to >> force GCC to use only the xmm registers for all floating point >> calculations?(I tried using the -mno-80387 option but I am getting >> lots of weird linker errors with that). Or is there anyway to move >> data between x87 and xmm registers without using memory as an >> intermediary ? >> >> Regards >> Gautam >> >