Hi, I must specify that I'm not an expert in sse code. In fact, I never writed some. But I'm looking for way to have the best speed for my application as they take too much time to execute. So I look at many different technology about this and the information I found about sse made me think about that trouble. But they are subtilities as the difference is not only the number of concurrent operation, but also the operation that can be done. So a more detailed theorical explanation is that if you rewrite in sse code then the equivalent instruction in x87 you would not have speed up. But if you can take advantage of the new instruction too, meaby you will have. At least that is how I think right now. just my taught as you aren't the first to post the same question that you don't have speed up. Frederic Bastien On Fri, Jun 6, 2008 at 4:49 AM, Gautam Sewani <gautamcool88@xxxxxxxxx> wrote: > That is very bad news indeed :-( . > Can anyone confirm this with some testing? (I am using a Core duo, and > don't have access to Core 2 Duo.) > Regards > Gautam > On Thu, Jun 5, 2008 at 7:26 PM, Frédéric Bastien <nouiz@xxxxxxxxx> wrote: >> Hi, >> >> With processor before core2 from intel, their was a bottleneck in the >> CPU that make all sse instruction being split in two. So as you have >> only two double in a sse instruction and if you have a processor with >> such a bottleneck, I see only 1 way to have a speed up. Use float >> instead of double. I know, this is not always an option. To my >> knowledge prescott cpu have this bottleneck. >> >> Frederic Bastien >> >> On Thu, Jun 5, 2008 at 2:39 AM, Gautam Sewani <gautamcool88@xxxxxxxxx> wrote: >>> Hi, >>> I am trying to vectorize a piece of code using SSE 2 intrinsics (the >>> one's in emmintrin.h).I am using double precision floating point >>> arithmetic.The running times I obtained were very similar with and >>> without the vectorization. I suspect the reason for this is that in >>> the vectorized code, I am storing the contents of a packed xmm >>> register (represented by an __m128d variable) into a double array. >>> >>> Looking into the assembly code generated, I saw that for this, the >>> contents of the xmm register were first saved to a memory location and >>> then loaded into the x87 FPU stack. Apparently there is no direct way >>> to transfer data between x87 and xmm registers. One way to eliminate >>> this would be to use xmm registers for all floating point >>> calculations. But inspite of using -march=prescott and -mfpmath=sse, >>> x87 instructions like fld and fstp are still used. Is there any to >>> force GCC to use only the xmm registers for all floating point >>> calculations?(I tried using the -mno-80387 option but I am getting >>> lots of weird linker errors with that). Or is there anyway to move >>> data between x87 and xmm registers without using memory as an >>> intermediary ? >>> >>> Regards >>> Gautam >>> >> >