Re: enabling SSE for 3-vector inner product

Qianqian Fang <fangqq@xxxxxxxxx> · Tue, 04 May 2010 10:21:36 -0400

On 05/02/2010 01:24 PM, Brian Budge wrote:
Hi Qianquan-

The way that SSE and many other SIMD instruction sets work best is if
you lay your computation as a struct-of-arrays format vs an
array-of-structs format.

Image you have 12 dot products to perform.  Your current layout is
array-of-structs.  You have 12 float3s.  Instead, you could lay this
out as 6 __m128s for x, 6 __m128s for y, and 6 __m128s for z.

__m128 x0[3], y0[3], z0[3], x1[3], y1[3], z1[3];

__m128 r[3];

for(size_t i = 0; i<  3; ++i) {
     r[i] = x0[i] * x1[i] + y0[i] * y1[i] + z0[i] * z1[i];
}

This is roughly an optimal way of calculating these dot products with
SSE.  The issue is that in order to really get the full benefit of
this, your algorithm should try to keep the data strided in this way
all the time, so that you don't have to waste cycles reorganizing your
data into this format.  Often such changes will require modification
of your whole algorithm; this is not always a bad thing though.  Once
you change your data layout to be like this persistently, you widen
the optimization opportunities of the rest of the program.

You say that this section of dot products takes 30% of the time.
Imagine you speed it up 4x.  Now something else is likely to be your
bottleneck.  If you change your algorithm to be SIMD-friendly all
over, there are more opportunites for optimization in other parts of
the code.

Good luck!

thank you very much Brian. Your comments are very helpful.
I will try to rearrange the data structure based on your
suggestions. Meanwhile, I will also port my code to OpenCL
and hopefully this will be significantly more efficient
in the GPU.

Qianqian

   Brian

On Sat, May 1, 2010 at 8:37 PM, Qianqian Fang<fangqq@xxxxxxxxx>  wrote:

On 05/01/2010 09:58 PM, Brian Budge wrote:

Unless you modify your algorithm to perform many dot products
simultaneously, you're probably at your limit.  Depending on what
you're trying to do, you might get close to a 4x speedup on the dot
products (conversely, you might not be able to do any better)

hi Brian

may I ask you what did you mean by "many dot products simultaneously"?

from my reading, there are only limited number of xmm registers,
in order to do a dot product, I need to at least do

_mm_loadu_ps(p1);
_mm_loadu_ps(p2);
_mm_dp_ps(p1,p2);
_mm_store_ss(result1);

if I do many dot products, did you mean by repeating the above
pseudo-code for different vectors? or try to run
as many mm_dp_ps as possible for a fixed set of vectors?

In the inner loop of my code, I need to perform 12 inner
products in a row using 14 vectors. Will the overhead for
loading/storing kill the speed-up?

thanks

Qianqian

   Brian

On Fri, Apr 30, 2010 at 8:35 PM, Qianqian Fang<fangqq@xxxxxxxxx>    wrote:

hi Marc

On 04/30/2010 06:31 AM, Marc Glisse wrote:

On Thu, 29 Apr 2010, Qianqian Fang wrote:

Shouldn't there be some magic here for alignment purposes?

thank you for pointing this out. I changed the definition to

typedef struct CPU_float4{
    float x,y,z,w;
} float4 __attribute__ ((aligned(16)));

but the run-time using SSE3 remains the same.
Is my above change correct?

now I am trying to use SSE4.x DPPS, but gcc gave me
error. I don't know if I used it with a wrong format.

Did you try using the intrinsic _mm_dp_ps?

yes, I removed the asm and use mm_dp_ps, it works now.
the code now looks like this:

inline float vec_dot(float3 *a,float3 *b){
        float dot;
        __m128 na,nb,res;
        na=_mm_loadu_ps((float*)a);
        nb=_mm_loadu_ps((float*)b);
        res=_mm_dp_ps(na,nb,0x7f);
        _mm_store_ss(&dot,res);
        return dot;
}

sadly, using SSE4 only gave me a few percent (2~5%)
speed-up over the original C code. My profiling result
indicated the inner product took about 30% of my total
run time. Does this speedup make sense?

               "dpps %%xmm0, %%xmm1, 0xF1 \n\t"

Maybe the order of the arguments is reversed in asm and it likes a $
before a constant (and it prefers fewer parentheses on the next line).

with gcc -S, I can see that the assembly is in fact
dpps 127, xmm1, xmm0, so perhaps it was reversed
in my previous version.

In any case, you shouldn't get a factor 2 compared to the SSE3 version,
so
that won't be enough for you.

well, as I mentioned earlier, using SSE3 made my code 2.5x slower, not
faster.
SSE4 is now 2~5% faster, but still not as significant as I thought.
I guess that's probably the best I can do with it. Right?

thanks

Qianqian