Re: enabling SSE for 3-vector inner product

Axel Freyn <axel-freyn@xxxxxx> · Thu, 29 Apr 2010 19:17:14 +0200

Hi Qianqian,
>
First: I don't know anything about the vectorizer, so be very careful
with my answer;-)
> My code looks like this:
>
> typedef struct CPU_float3{
>     float x,y,z;
> } float3;
> float vec_dot(float3 *a,float3 *b){
>         return a->x*b->x+a->y*b->y+a->z*b->z;
> }
> float pinner(float3 *Pd,float3 *Pm,float3 *Ad,float3 *Am){
>         return vec_dot(Pd,Am)+vec_dot(Pm,Ad);
> }
> ...
>
> and then I call pinner() a lot in my main function.
>
> Here are my questions:
>
> 1. when I compile the above code with gcc -O3 option, will the
> above vec_dot function be translated to SSE automatically?
I think: in general not. The vectorizer does only vectorize loops.
And in addition, you will have to add "-ffast-math" to the compiler, to
authorize vectorization (I think?). When you compile your code with the
option "-ftree-vectorizer-verbose=2":

gcc-4.5 -O3 -ffast-math -ftree-vectorizer-verbose=2  -c sse.c

it tells you about what the vectorizer is doing: nothing... (I simply
compiled the two functions vec_dot and pinner from you)

However, if you would write vec_dot as
float vec_dot(float3 *a,float3 *b){
  float dot=0;
  int i;
  for(i = 0; i < 3; ++i)
    dot+=  a->x[i]*b->x[i];
  return dot;
}
, gcc would vectorize it, however not for a loop with only 3 iterations:
sse.c:7: note: not vectorized: iteration count too small.
sse.c:4: note: vectorized 0 loops in function.

However, as soon as you call vec_dot and pinner often on adjacent
elements, it might be that the vectorizer will be used therefor... Just
try to compile your code with "-ftree-vectorizer-verbose=2" (and maybe
"-ffast-math", if you can accept that loose of precision / weakening of
the standard (see man-page))
>
> 2. if not, anyone can suggest a SSE instruction
> to accelerate the above computation?
>
> 3. is "inline" a valid option for GCC when compiling a C code?
Yes, it is. However, as soon as the function is defined in the same
compilation unit where it is used, gcc with -O3 will automatically
inline everything (at least: when gcc believes it to be usefull :-))

Axel