On 12/14/2010 5:33 AM, windigo84 wrote:
I am using C language and gcc compiler to program some vectorizable loops
with a Nehalem processor. The function parameters are the type of const
float* and float*. The problem I have is that the gcc performs the loads
from these parameters to the vector registers with these two instructions:
movlps (%rdx), %xmm0
movhps 8(%rdx), %xmm0
instead of doing with:
movups (%rdx), %xmm0 (if unaligned access)
movaps (%rdx), %xmm0 (if aligned access)
I am compiling with the next flags:
-O2 -fexpensive-optimizations -ftree-vectorize -fargument-noalias-global
-msse3 -ftree-vectorizer-verbose=2
I would like to know several things:
1.- How can avoid the load to be performed with two move instructions
instead of one
2.- Once gcc performs the loads with only one move instruction, how can I
force to use only aligned move instructions
3.- Finally (out of scope of this message), if I use O3 optimization flag
the gcc is not able to vectorize my loops because they are fully unrolled
before the vectorization optimization. The loops iterates from 0 to 8. Is
there any way to avoid this?
You're asking gcc to optimize for Core 2 and other CPUs of that era,
where unaligned moves were more expensive than split ones.
If your gcc is so old that it doesn't support -mtune=barcelona, you will
need to upgrade if you want gcc to use movups. That option is suitable
for recent Intel CPUs as well as AMD.
You would require _attribute__(aligned) and similar gcc extensions to
inform the compiler about alignments so as to enable more use of aligned
instructions. Without those, it's not practical to auto-vectorize short
loops, except by explicit low-level coding (e.g. sse intrinsics). In
any case, a loop of trip 9 might not be efficiently vectorizable.
--
Tim Prince