Re: Vector parameter loads to SSE registers

Tim Prince <n8tm@xxxxxxx> · Tue, 14 Dec 2010 09:27:09 -0500

On 12/14/2010 5:33 AM, windigo84 wrote:
I am using C language and gcc compiler to program some vectorizable loops
with a Nehalem processor. The function parameters are the type of const
float* and float*. The problem I have is that the gcc performs the loads
from these parameters to the vector registers with these two instructions:

movlps	(%rdx), %xmm0
movhps	8(%rdx), %xmm0

instead of doing with:

movups (%rdx), %xmm0              (if unaligned access)
movaps (%rdx), %xmm0              (if aligned access)

I am compiling with the next flags:

-O2  -fexpensive-optimizations -ftree-vectorize -fargument-noalias-global
-msse3  -ftree-vectorizer-verbose=2

I would like to know several things:

1.- How can avoid the load to be performed with two move instructions
instead of one
2.- Once gcc performs the loads with only one move instruction, how can I
force to use only aligned move instructions
3.- Finally (out of scope of this message), if I use O3 optimization flag
the gcc is not able to vectorize my loops because they are fully unrolled
before the vectorization optimization. The loops iterates from 0 to 8. Is
there any way to avoid this?

You're asking gcc to optimize for Core 2 and other CPUs of that era, 
where unaligned moves were more expensive than split ones.
If your gcc is so old that it doesn't support -mtune=barcelona, you will 
need to upgrade if you want gcc to use movups.  That option is suitable 
for recent Intel CPUs as well as AMD.
You would require  _attribute__(aligned) and similar gcc extensions to 
inform the compiler about alignments so as to enable more use of aligned 
instructions.  Without those, it's not practical to auto-vectorize short 
loops, except by explicit low-level coding (e.g. sse intrinsics).  In 
any case, a loop of trip 9 might not be efficiently vectorizable.

--

Tim Prince