Oh you are correct ... that improved a lot. However, it still runs slower than serial version, about 1 second more for the 10,000 data example. Thanks. On Mon, Apr 7, 2008 at 9:08 AM, Dario Saccavino <kathoum@xxxxxxxxx> wrote: > Hello Dario, > > I haven't tried your code yet but I think you could get a good boost > if you replace the "sqrt_pd" call with "sqrt_sd", since you only need > the square root of a scalar. > > Dario > > > > > > > > inline static double dist_sse(int i,int j) > > { > > double d; > > __m128d xmm0,xmm1; > > xmm0 =_mm_load_pd(C[i]); > > xmm1 = _mm_load_pd(C[j]); > > xmm0 = _mm_sub_pd(xmm0,xmm1); > > xmm1 = xmm0; > > xmm0 = _mm_mul_pd(xmm0,xmm1); > > xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1)); > > xmm0 = _mm_add_pd(xmm0,xmm1); > > xmm0 = _mm_sqrt_pd(xmm0); > > _mm_store_sd(&d,xmm0); > > return rint(d); > > } > > >