Re: enabling SSE for 3-vector inner product

Brian Budge <brian.budge@xxxxxxxxx> · Sun, 2 May 2010 10:24:15 -0700

Hi Qianquan-

The way that SSE and many other SIMD instruction sets work best is if
you lay your computation as a struct-of-arrays format vs an
array-of-structs format.

Image you have 12 dot products to perform.  Your current layout is
array-of-structs.  You have 12 float3s.  Instead, you could lay this
out as 6 __m128s for x, 6 __m128s for y, and 6 __m128s for z.

__m128 x0[3], y0[3], z0[3], x1[3], y1[3], z1[3];

__m128 r[3];

for(size_t i = 0; i < 3; ++i) {
    r[i] = x0[i] * x1[i] + y0[i] * y1[i] + z0[i] * z1[i];
}

This is roughly an optimal way of calculating these dot products with
SSE.  The issue is that in order to really get the full benefit of
this, your algorithm should try to keep the data strided in this way
all the time, so that you don't have to waste cycles reorganizing your
data into this format.  Often such changes will require modification
of your whole algorithm; this is not always a bad thing though.  Once
you change your data layout to be like this persistently, you widen
the optimization opportunities of the rest of the program.

You say that this section of dot products takes 30% of the time.
Imagine you speed it up 4x.  Now something else is likely to be your
bottleneck.  If you change your algorithm to be SIMD-friendly all
over, there are more opportunites for optimization in other parts of
the code.

Good luck!
  Brian

On Sat, May 1, 2010 at 8:37 PM, Qianqian Fang <fangqq@xxxxxxxxx> wrote:
> On 05/01/2010 09:58 PM, Brian Budge wrote:
>>
>> Unless you modify your algorithm to perform many dot products
>> simultaneously, you're probably at your limit.  Depending on what
>> you're trying to do, you might get close to a 4x speedup on the dot
>> products (conversely, you might not be able to do any better)
>>
>
> hi Brian
>
> may I ask you what did you mean by "many dot products simultaneously"?
>
> from my reading, there are only limited number of xmm registers,
> in order to do a dot product, I need to at least do
>
> _mm_loadu_ps(p1);
> _mm_loadu_ps(p2);
> _mm_dp_ps(p1,p2);
> _mm_store_ss(result1);
>
> if I do many dot products, did you mean by repeating the above
> pseudo-code for different vectors? or try to run
> as many mm_dp_ps as possible for a fixed set of vectors?
>
> In the inner loop of my code, I need to perform 12 inner
> products in a row using 14 vectors. Will the overhead for
> loading/storing kill the speed-up?
>
> thanks
>
> Qianqian
>
>>   Brian
>>
>> On Fri, Apr 30, 2010 at 8:35 PM, Qianqian Fang<fangqq@xxxxxxxxx>  wrote:
>>
>>>
>>> hi Marc
>>>
>>> On 04/30/2010 06:31 AM, Marc Glisse wrote:
>>>
>>>>
>>>> On Thu, 29 Apr 2010, Qianqian Fang wrote:
>>>>
>>>> Shouldn't there be some magic here for alignment purposes?
>>>>
>>>
>>> thank you for pointing this out. I changed the definition to
>>>
>>> typedef struct CPU_float4{
>>>    float x,y,z,w;
>>> } float4 __attribute__ ((aligned(16)));
>>>
>>> but the run-time using SSE3 remains the same.
>>> Is my above change correct?
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> now I am trying to use SSE4.x DPPS, but gcc gave me
>>>>> error. I don't know if I used it with a wrong format.
>>>>>
>>>>
>>>> Did you try using the intrinsic _mm_dp_ps?
>>>>
>>>
>>> yes, I removed the asm and use mm_dp_ps, it works now.
>>> the code now looks like this:
>>>
>>> inline float vec_dot(float3 *a,float3 *b){
>>>        float dot;
>>>        __m128 na,nb,res;
>>>        na=_mm_loadu_ps((float*)a);
>>>        nb=_mm_loadu_ps((float*)b);
>>>        res=_mm_dp_ps(na,nb,0x7f);
>>>        _mm_store_ss(&dot,res);
>>>        return dot;
>>> }
>>>
>>> sadly, using SSE4 only gave me a few percent (2~5%)
>>> speed-up over the original C code. My profiling result
>>> indicated the inner product took about 30% of my total
>>> run time. Does this speedup make sense?
>>>
>>>
>>>>>
>>>>>               "dpps %%xmm0, %%xmm1, 0xF1 \n\t"
>>>>>
>>>>
>>>> Maybe the order of the arguments is reversed in asm and it likes a $
>>>> before a constant (and it prefers fewer parentheses on the next line).
>>>>
>>>
>>> with gcc -S, I can see that the assembly is in fact
>>> dpps 127, xmm1, xmm0, so perhaps it was reversed
>>> in my previous version.
>>>
>>>
>>>
>>>>
>>>> In any case, you shouldn't get a factor 2 compared to the SSE3 version,
>>>> so
>>>> that won't be enough for you.
>>>>
>>>
>>> well, as I mentioned earlier, using SSE3 made my code 2.5x slower, not
>>> faster.
>>> SSE4 is now 2~5% faster, but still not as significant as I thought.
>>> I guess that's probably the best I can do with it. Right?
>>>
>>> thanks
>>>
>>> Qianqian
>>>
>>>
>>
>>
>
>