hi Brian
Although I haven't tried this kind of thing with the new SSE4+
instructions, with older instruction sets, in general, using SSE ps
instructions in these cases will actually reduce performance. Even if
you had a float4 type instead of a float3, it's unlikely that you'd
get a speed improvement using structs like this.
SSE, and most other SIMD methodologies work best with a
struct-of-arrays type of format. The overhead for SSE will simply be
too high to be worth the benefits derived from SSE for a case like the
one presented. You might have to think at a higher algorithmic level
to make good use of SSE.
thank you for your comments. I did some tests and found
what you said was correct.
Here is my SSE3 vec_dot
struct CPU_float4{
float x,y,z,w;
}
typedef struct CPU_float4 pvec;
float vec_dot(pvec *a,pvec *b){
float dot;
__m128 na,nb,res;
na=_mm_loadu_ps((float*)a);
nb=_mm_loadu_ps((float*)b);
res=_mm_mul_ps(na,nb);
res=_mm_hadd_ps(res,res);
res=_mm_hadd_ps(res,res);
_mm_store_ss(&dot,res);
return dot;
}
with this function, the run time is about 2.5x slower
than the original code :( (compiled with
gcc -c -Wall -g -O3 -ftree-vectorizer-verbose=2 -DMMC_USE_SSE -msse3)
now I am trying to use SSE4.x DPPS, but gcc gave me
error. I don't know if I used it with a wrong format.
float vec_dot(float3 *a,float3 *b){
float c;
__asm__ __volatile__
(
"movups (%[a]), %%xmm0 \n\t"
"movups (%[b]), %%xmm1 \n\t"
"dpps %%xmm0, %%xmm1, 0xF1 \n\t"
"movss (%[c]), %%xmm0 \n\t"
: [c] "=m" (c)
: [a] "r" (a), [b] "r" (b)
: "%xmm0", "%xmm1"
);
return c;
}
gcc -c -Wall -g -O3 -ftree-vectorizer-verbose=2 -msse4.1 simpmesh.c
simpmesh.c:56: Error: suffix or operands invalid for `dpps'
simpmesh.c:57: Error: missing ')'
simpmesh.c:57: Error: junk `(%rsp))' after expression
...
did I miss anything obvious in the above code?
thanks
Qianqian
Brian
On Thu, Apr 29, 2010 at 10:17 AM, Axel Freyn <axel-freyn@xxxxxx> wrote:
Hi Qianqian,
First: I don't know anything about the vectorizer, so be very careful
with my answer;-)
My code looks like this:
typedef struct CPU_float3{
? ? float x,y,z;
} float3;
float vec_dot(float3 *a,float3 *b){
? ? ? ? return a->x*b->x+a->y*b->y+a->z*b->z;
}
float pinner(float3 *Pd,float3 *Pm,float3 *Ad,float3 *Am){
? ? ? ? return vec_dot(Pd,Am)+vec_dot(Pm,Ad);
}
...
and then I call pinner() a lot in my main function.
Here are my questions:
1. when I compile the above code with gcc -O3 option, will the
above vec_dot function be translated to SSE automatically?
I think: in general not. The vectorizer does only vectorize loops.
And in addition, you will have to add "-ffast-math" to the compiler, to
authorize vectorization (I think?). When you compile your code with the
option "-ftree-vectorizer-verbose=2":
gcc-4.5 -O3 -ffast-math -ftree-vectorizer-verbose=2 ?-c sse.c
it tells you about what the vectorizer is doing: nothing... (I simply
compiled the two functions vec_dot and pinner from you)
However, if you would write vec_dot as
float vec_dot(float3 *a,float3 *b){
?float dot=0;
?int i;
?for(i = 0; i < 3; ++i)
? ?dot+= ?a->x[i]*b->x[i];
?return dot;
}
, gcc would vectorize it, however not for a loop with only 3 iterations:
sse.c:7: note: not vectorized: iteration count too small.
sse.c:4: note: vectorized 0 loops in function.
However, as soon as you call vec_dot and pinner often on adjacent
elements, it might be that the vectorizer will be used therefor... Just
try to compile your code with "-ftree-vectorizer-verbose=2" (and maybe
"-ffast-math", if you can accept that loose of precision / weakening of
the standard (see man-page))
2. if not, anyone can suggest a SSE instruction
to accelerate the above computation?
3. is "inline" a valid option for GCC when compiling a C code?
Yes, it is. However, as soon as the function is defined in the same
compilation unit where it is used, gcc with -O3 will automatically
inline everything (at least: when gcc believes it to be usefull :-))
Axel