hi list I am working on a computing code and realized that a simple inner product of float triplets is taking 30% of my run time when compiling with GCC -O3. I want to explore options to further accelerate this code and came up with a couple of questions concerning using SSE in GCC. My code looks like this: typedef struct CPU_float3{ float x,y,z; } float3; ... float vec_dot(float3 *a,float3 *b){ return a->x*b->x+a->y*b->y+a->z*b->z; } float pinner(float3 *Pd,float3 *Pm,float3 *Ad,float3 *Am){ return vec_dot(Pd,Am)+vec_dot(Pm,Ad); } ... and then I call pinner() a lot in my main function. Here are my questions: 1. when I compile the above code with gcc -O3 option, will the above vec_dot function be translated to SSE automatically? 2. if not, anyone can suggest a SSE instruction to accelerate the above computation? 3. is "inline" a valid option for GCC when compiling a C code? any suggestions for improving the efficiency is highly appreciated. thanks Qianqian