Re: Vector parameter loads to SSE registers

windigo84 <aberna@xxxxxxxxxx> · Wed, 15 Dec 2010 02:24:25 -0800 (PST)

Thanks Tim. Now it's using only one move instruction (I am using gcc  4.4.3).
About the alignment problem, I am using this typedef:

typedef float* af __attribute__ ((__aligned__(16)));

But anyway it uses unaligned move instructions. Maybe it is using unaligned
move instructions because for Nehalem architectures (Barcelona tune) the
movups and the movaps instructions spend the same cycles for aligned moves
and thus the compiler is conservative and it uses movups. Anyway thanks,

Jandro

Tim Prince-4 wrote:
> 
> On 12/14/2010 5:33 AM, windigo84 wrote:
>> I am using C language and gcc compiler to program some vectorizable loops
>> with a Nehalem processor. The function parameters are the type of const
>> float* and float*. The problem I have is that the gcc performs the loads
>> from these parameters to the vector registers with these two
>> instructions:
>>
>> movlps	(%rdx), %xmm0
>> movhps	8(%rdx), %xmm0
>>
>>
>> instead of doing with:
>>
>> movups (%rdx), %xmm0              (if unaligned access)
>> movaps (%rdx), %xmm0              (if aligned access)
>>
>>
>> I am compiling with the next flags:
>>
>> -O2  -fexpensive-optimizations -ftree-vectorize -fargument-noalias-global
>> -msse3  -ftree-vectorizer-verbose=2
>>
>>
>> I would like to know several things:
>>
>> 1.- How can avoid the load to be performed with two move instructions
>> instead of one
>> 2.- Once gcc performs the loads with only one move instruction, how can I
>> force to use only aligned move instructions
>> 3.- Finally (out of scope of this message), if I use O3 optimization flag
>> the gcc is not able to vectorize my loops because they are fully unrolled
>> before the vectorization optimization. The loops iterates from 0 to 8. Is
>> there any way to avoid this?
>>
>>
> You're asking gcc to optimize for Core 2 and other CPUs of that era, 
> where unaligned moves were more expensive than split ones.
> If your gcc is so old that it doesn't support -mtune=barcelona, you will 
> need to upgrade if you want gcc to use movups.  That option is suitable 
> for recent Intel CPUs as well as AMD.
> You would require  _attribute__(aligned) and similar gcc extensions to 
> inform the compiler about alignments so as to enable more use of aligned 
> instructions.  Without those, it's not practical to auto-vectorize short 
> loops, except by explicit low-level coding (e.g. sse intrinsics).  In 
> any case, a loop of trip 9 might not be efficiently vectorizable.
> 
> -- 
> 
> Tim Prince
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Vector-parameter-loads-to-SSE-registers-tp30453479p30462570.html
Sent from the gcc - Help mailing list archive at Nabble.com.