Re: Vector parameter loads to SSE registers

Tim Prince <n8tm@xxxxxxx> · Wed, 15 Dec 2010 09:59:26 -0500

On 12/15/2010 5:24 AM, windigo84 wrote:
Thanks Tim. Now it's using only one move instruction (I am using gcc  4.4.3).
About the alignment problem, I am using this typedef:

typedef float* af __attribute__ ((__aligned__(16)));

But anyway it uses unaligned move instructions. Maybe it is using unaligned
move instructions because for Nehalem architectures (Barcelona tune) the
movups and the movaps instructions spend the same cycles for aligned moves
and thus the compiler is conservative and it uses movups. Anyway thanks,

Jandro
Maybe the actual definition of the aligned data has to be visible before 
gcc will take it as aligned.  As you say, movaps is no faster for loads 
than movups on current CPUs, so the main question is whether the 
compiler chooses to generate a remainder loop for alignment.  Such a 
remainder loop will interfere with efficiency of short loop vectorization.

--
Tim Prince