Hello, I tried with your options but it seems to make no difference. In another email it was suggested to use _mm_sqrt_sd, because I only needed one sqrt calculation. That improved time and indeed, almost reach serial version (now it runs up to 1 second slower for the 10,000 data example, hehe). But of course, I would wanna/expect the vector version to run faster ... still unsure how to achieve that. Thanks On Mon, Apr 7, 2008 at 10:23 AM, jlh <jlh@xxxxxx> wrote: > Dario Bahena Tapia wrote: > > > > > inline static double dist(int i,int j) > > { > > double xd = C[i][X] - C[j][X]; > > double yd = C[i][Y] - C[j][Y]; > > return rint(sqrt(xd*xd + yd*yd)); > > } > > [...] > > > > And in order to activate the SSE2 features, I am using the following > > flags for gcc (my computer is a laptop): > > > > CFLAGS = -O -Wall -march=pentium-m -msse2 > > > > These options do not make dist() use any SSE for me. Have you > tried compiling with this? > > CFLAGS = -O2 -Wall -march=pentium-m -mfpmath=sse > > I think -msse2 is redundant if you say -march-pentium-m. I don't > have an SSE2 machine to try this though. > > jlh >