I am new to the world of SSE, but in trying to speed up some C code I have run into a wall which is both perplexing and frustrating (since I can't find a solution). I am hoping someone here can provide the help I seek. I thank you for all your assistance! My (watered down version) code is as follows (running on a pentium4 based machine and compiling with gcc 4.02 using the compile options: -O3 -Wall -march=pentium4 -msse2 -mfpmath=sse): // standard C #include files are put here #include <emmintrin.h> // I will actually eventually be using sse2 and // sse instructions #include <mm_malloc.h> void main() { float *ptr1,*ptr2,*ptr3,*tptr1,*tptr2; __m128 m1,m2,m3,*sptr1,*sptr2,*sptr3; int i,j,arraysize=1000,loopcount=10; // allocate space for dynamic arrays that are aligned to 16-byte boundary (note that arraysize will actually be read into this program in the final version). ptr1=(float *) __mm_malloc(arraysize*sizeof(float),16); ptr2=(float *) __mm_malloc(arraysize*sizeof(float),16); ptr3=(float *) __mm_malloc(arraysize*sizeof(float),16); tptr1=ptr1; tptr2=ptr2; // fill in two of the arrays with some numbers for(i=0;i<arraysize;i++,tptr1++,tptr2++) { *tptr1=(float)rand(); *tptr2=(float)rand(); } // TIMING LOOP STARTS for(i=0;i<loopcount;i++) { sptr1=(__m128) ptr1; // cast to size 128 bits sptr2=(__m128) ptr2; sptr3=(__m128_ ptr3; for(j=0;j<arraysize;j++,stptr1++,stptr2++,sptr3++) { m1=*sptr1; m2=*sptr2; m3=_mm_mul_ps(m1,m2); // use SSE intrinsic instruction to // multiply two numbers (note that even if I use *sptrx // instead of mx I will get the same speed problem). *sptr3=m3; } } // TIMING LOOP ENDS HERE } So my speed problem is as follows. Without the line "*sptr3=m3;" the TIMING LOOP works as expected. That is, four times faster than if I used normal float values instead of quad sized float values (i.e. __m128). With the line "*sptr3=m3;" inside this TIMING LOOP the code runs about 3 times slower than when using normal float values. For some reason writing to the pointer location of type __m128 seems to slow things down, but reading from it is fine (e.g. line "m1=*sptr1;"). If I write the computed/multiplied data to a static array (but I truly need a dynamic array) such as x.m[j*i]=m3; // that is, replace line *sptr3=m3 with this line where , say union { __m128m m[1000*10]; float f[1000*10][4]; } x then the program runs as fast as expected. So what may I be doing wrong with my code such that I do not effectively take advantage of SSE capabilities in the pentium 4? -- View this message in context: http://www.nabble.com/HELP-With-Slow-SSE-Code-t1738578.html#a4724748 Sent from the gcc - Help forum at Nabble.com.