Unless you modify your algorithm to perform many dot products simultaneously, you're probably at your limit. Depending on what you're trying to do, you might get close to a 4x speedup on the dot products (conversely, you might not be able to do any better) Brian On Fri, Apr 30, 2010 at 8:35 PM, Qianqian Fang <fangqq@xxxxxxxxx> wrote: > hi Marc > > On 04/30/2010 06:31 AM, Marc Glisse wrote: >> >> On Thu, 29 Apr 2010, Qianqian Fang wrote: >> >> Shouldn't there be some magic here for alignment purposes? > > thank you for pointing this out. I changed the definition to > > typedef struct CPU_float4{ > float x,y,z,w; > } float4 __attribute__ ((aligned(16))); > > but the run-time using SSE3 remains the same. > Is my above change correct? > >> >>> now I am trying to use SSE4.x DPPS, but gcc gave me >>> error. I don't know if I used it with a wrong format. >> >> Did you try using the intrinsic _mm_dp_ps? > > yes, I removed the asm and use mm_dp_ps, it works now. > the code now looks like this: > > inline float vec_dot(float3 *a,float3 *b){ > float dot; > __m128 na,nb,res; > na=_mm_loadu_ps((float*)a); > nb=_mm_loadu_ps((float*)b); > res=_mm_dp_ps(na,nb,0x7f); > _mm_store_ss(&dot,res); > return dot; > } > > sadly, using SSE4 only gave me a few percent (2~5%) > speed-up over the original C code. My profiling result > indicated the inner product took about 30% of my total > run time. Does this speedup make sense? > >>> "dpps %%xmm0, %%xmm1, 0xF1 \n\t" >> >> Maybe the order of the arguments is reversed in asm and it likes a $ >> before a constant (and it prefers fewer parentheses on the next line). > > > with gcc -S, I can see that the assembly is in fact > dpps 127, xmm1, xmm0, so perhaps it was reversed > in my previous version. > > >> In any case, you shouldn't get a factor 2 compared to the SSE3 version, so >> that won't be enough for you. > > well, as I mentioned earlier, using SSE3 made my code 2.5x slower, not > faster. > SSE4 is now 2~5% faster, but still not as significant as I thought. > I guess that's probably the best I can do with it. Right? > > thanks > > Qianqian >