Re: Why worse performace in euclidean distance with SSE2?

"Dario Bahena Tapia" <dario.mx@xxxxxxxxx> · Tue, 8 Apr 2008 09:55:49 -0500

Hello,
Yeah, others have suggested as well changing the way i process them inorder to allow for that. Working there ;-|
Will consider the other suggestions as well !!!
Thanks.

On Tue, Apr 8, 2008 at 2:07 AM, Zuxy Meng <zuxy.meng@xxxxxxxxx> wrote:> Hi,>>  "Dario Bahena Tapia" <dario.mx@xxxxxxxxx>> 写入消息新闻:3d104d6f0804070617u47213cc8nbc697dab9dc262b5@xxxxxxxxxxxxxxxxx>>>> > Hello,> >> > I have just begun to play with SSE2 and gcc intrinsics. Indeed, maybe> > this question is not exactly about gcc  ... but I think gcc lists are> > a very good place to find help from  hardcore assembler hackers ;-1> >> > I have a program which makes heavy usage of euclidean distance> > function. The serial version is:> >> > inline static double dist(int i,int j)> > {> >  double xd = C[i][X] - C[j][X];> >  double yd = C[i][Y] - C[j][Y];> >  return rint(sqrt(xd*xd + yd*yd));> > }> >> > As you can see each C[i] is an array of two double which represents a> > 2D vector (indexes 0 and 1 are coordinates X,Y respectively). I tried> > to vectorize the function using SSE2 and gcc intrinsics, here is the> > result:> >> > inline static double dist_sse(int i,int j)> > {> >  double d;> >  __m128d xmm0,xmm1;> >  xmm0 =_mm_load_pd(C[i]);> >  xmm1 = _mm_load_pd(C[j]);> >  xmm0 = _mm_sub_pd(xmm0,xmm1);> >  xmm1 = xmm0;> >  xmm0 = _mm_mul_pd(xmm0,xmm1);> >  xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1));> >  xmm0 = _mm_add_pd(xmm0,xmm1);> >  xmm0 = _mm_sqrt_pd(xmm0);> >  _mm_store_sd(&d,xmm0);> >  return rint(d);> > }> >> > Of course each C[i] was aligned as SSE2 expects:> >> > for(i=0; i<D; i++)> > C[i] = (double *) _mm_malloc(2 * sizeof(double), 16);> >> > And in order to activate the SSE2 features, I am using the following> > flags for gcc (my computer is a laptop):> >> > CFLAGS = -O -Wall -march=pentium-m -msse2> >> > The vectorized version of the function seems to be correct, given it> > provides same results as serial counterpart. However, the performace> > is poor; execution time of program increases in approximately 50% (for> > example, in calculating the distance of every pair of points from a> > set of 10,000, the serial version takes around 8 seconds while> > vectorized flavor takes 12).> >> > I was expecting a better time given that:> >> > 1. The difference of X and Y is done in parallel> > 2. The product of each difference coordinate with itself is also done> > in parallel> > 3. The sqrt function used is hardware implemented (although serial> > sqrt implementation could also take advantage of hardware)> >> > I suppose the problem here is my lack of experience programming in> > assembler in general, and in particular with SSE2. Therefore, I am> > looking for advice.> >>>  1. First of all, you didn't extract the parallelism in your algorithm. SSE2> won't help you if all you want is to pick up two points at random indices> and calculate the distance. However it will help you a lot when you> calculate the distances between a given point and 1 million others whose> indices are sequential.>>  2. Unroll the loop to hide the latency of square root as much as possible.>>  3. Since the final result is an integer, you may consider using "float"> instead of "double". That'll give you a performance boost even without SSE2.> And rsqrtps comes in handy too, if its precision is acceptable.>>  -->  Zuxy>>