On 05/02/2010 01:24 PM, Brian Budge wrote:
Hi Qianquan- The way that SSE and many other SIMD instruction sets work best is if you lay your computation as a struct-of-arrays format vs an array-of-structs format. Image you have 12 dot products to perform. Your current layout is array-of-structs. You have 12 float3s. Instead, you could lay this out as 6 __m128s for x, 6 __m128s for y, and 6 __m128s for z. __m128 x0[3], y0[3], z0[3], x1[3], y1[3], z1[3]; __m128 r[3]; for(size_t i = 0; i< 3; ++i) { r[i] = x0[i] * x1[i] + y0[i] * y1[i] + z0[i] * z1[i]; } This is roughly an optimal way of calculating these dot products with SSE. The issue is that in order to really get the full benefit of this, your algorithm should try to keep the data strided in this way all the time, so that you don't have to waste cycles reorganizing your data into this format. Often such changes will require modification of your whole algorithm; this is not always a bad thing though. Once you change your data layout to be like this persistently, you widen the optimization opportunities of the rest of the program. You say that this section of dot products takes 30% of the time. Imagine you speed it up 4x. Now something else is likely to be your bottleneck. If you change your algorithm to be SIMD-friendly all over, there are more opportunites for optimization in other parts of the code. Good luck!
thank you very much Brian. Your comments are very helpful. I will try to rearrange the data structure based on your suggestions. Meanwhile, I will also port my code to OpenCL and hopefully this will be significantly more efficient in the GPU. Qianqian
Brian On Sat, May 1, 2010 at 8:37 PM, Qianqian Fang<fangqq@xxxxxxxxx> wrote:On 05/01/2010 09:58 PM, Brian Budge wrote:Unless you modify your algorithm to perform many dot products simultaneously, you're probably at your limit. Depending on what you're trying to do, you might get close to a 4x speedup on the dot products (conversely, you might not be able to do any better)hi Brian may I ask you what did you mean by "many dot products simultaneously"? from my reading, there are only limited number of xmm registers, in order to do a dot product, I need to at least do _mm_loadu_ps(p1); _mm_loadu_ps(p2); _mm_dp_ps(p1,p2); _mm_store_ss(result1); if I do many dot products, did you mean by repeating the above pseudo-code for different vectors? or try to run as many mm_dp_ps as possible for a fixed set of vectors? In the inner loop of my code, I need to perform 12 inner products in a row using 14 vectors. Will the overhead for loading/storing kill the speed-up? thanks QianqianBrian On Fri, Apr 30, 2010 at 8:35 PM, Qianqian Fang<fangqq@xxxxxxxxx> wrote:hi Marc On 04/30/2010 06:31 AM, Marc Glisse wrote:On Thu, 29 Apr 2010, Qianqian Fang wrote: Shouldn't there be some magic here for alignment purposes?thank you for pointing this out. I changed the definition to typedef struct CPU_float4{ float x,y,z,w; } float4 __attribute__ ((aligned(16))); but the run-time using SSE3 remains the same. Is my above change correct?now I am trying to use SSE4.x DPPS, but gcc gave me error. I don't know if I used it with a wrong format.Did you try using the intrinsic _mm_dp_ps?yes, I removed the asm and use mm_dp_ps, it works now. the code now looks like this: inline float vec_dot(float3 *a,float3 *b){ float dot; __m128 na,nb,res; na=_mm_loadu_ps((float*)a); nb=_mm_loadu_ps((float*)b); res=_mm_dp_ps(na,nb,0x7f); _mm_store_ss(&dot,res); return dot; } sadly, using SSE4 only gave me a few percent (2~5%) speed-up over the original C code. My profiling result indicated the inner product took about 30% of my total run time. Does this speedup make sense?"dpps %%xmm0, %%xmm1, 0xF1 \n\t"Maybe the order of the arguments is reversed in asm and it likes a $ before a constant (and it prefers fewer parentheses on the next line).with gcc -S, I can see that the assembly is in fact dpps 127, xmm1, xmm0, so perhaps it was reversed in my previous version.In any case, you shouldn't get a factor 2 compared to the SSE3 version, so that won't be enough for you.well, as I mentioned earlier, using SSE3 made my code 2.5x slower, not faster. SSE4 is now 2~5% faster, but still not as significant as I thought. I guess that's probably the best I can do with it. Right? thanks Qianqian