Re: enabling SSE for 3-vector inner product

Brian Budge <brian.budge@xxxxxxxxx> · Sat, 1 May 2010 18:58:37 -0700

Unless you modify your algorithm to perform many dot products
simultaneously, you're probably at your limit.  Depending on what
you're trying to do, you might get close to a 4x speedup on the dot
products (conversely, you might not be able to do any better)

  Brian

On Fri, Apr 30, 2010 at 8:35 PM, Qianqian Fang <fangqq@xxxxxxxxx> wrote:
> hi Marc
>
> On 04/30/2010 06:31 AM, Marc Glisse wrote:
>>
>> On Thu, 29 Apr 2010, Qianqian Fang wrote:
>>
>> Shouldn't there be some magic here for alignment purposes?
>
> thank you for pointing this out. I changed the definition to
>
> typedef struct CPU_float4{
>    float x,y,z,w;
> } float4 __attribute__ ((aligned(16)));
>
> but the run-time using SSE3 remains the same.
> Is my above change correct?
>
>>
>>> now I am trying to use SSE4.x DPPS, but gcc gave me
>>> error. I don't know if I used it with a wrong format.
>>
>> Did you try using the intrinsic _mm_dp_ps?
>
> yes, I removed the asm and use mm_dp_ps, it works now.
> the code now looks like this:
>
> inline float vec_dot(float3 *a,float3 *b){
>        float dot;
>        __m128 na,nb,res;
>        na=_mm_loadu_ps((float*)a);
>        nb=_mm_loadu_ps((float*)b);
>        res=_mm_dp_ps(na,nb,0x7f);
>        _mm_store_ss(&dot,res);
>        return dot;
> }
>
> sadly, using SSE4 only gave me a few percent (2~5%)
> speed-up over the original C code. My profiling result
> indicated the inner product took about 30% of my total
> run time. Does this speedup make sense?
>
>>>               "dpps %%xmm0, %%xmm1, 0xF1 \n\t"
>>
>> Maybe the order of the arguments is reversed in asm and it likes a $
>> before a constant (and it prefers fewer parentheses on the next line).
>
>
> with gcc -S, I can see that the assembly is in fact
> dpps 127, xmm1, xmm0, so perhaps it was reversed
> in my previous version.
>
>
>> In any case, you shouldn't get a factor 2 compared to the SSE3 version, so
>> that won't be enough for you.
>
> well, as I mentioned earlier, using SSE3 made my code 2.5x slower, not
> faster.
> SSE4 is now 2~5% faster, but still not as significant as I thought.
> I guess that's probably the best I can do with it. Right?
>
> thanks
>
> Qianqian
>