On 12/15/2010 5:24 AM, windigo84 wrote:
Thanks Tim. Now it's using only one move instruction (I am using gcc 4.4.3).
About the alignment problem, I am using this typedef:
typedef float* af __attribute__ ((__aligned__(16)));
But anyway it uses unaligned move instructions. Maybe it is using unaligned
move instructions because for Nehalem architectures (Barcelona tune) the
movups and the movaps instructions spend the same cycles for aligned moves
and thus the compiler is conservative and it uses movups. Anyway thanks,
Jandro
Maybe the actual definition of the aligned data has to be visible before
gcc will take it as aligned. As you say, movaps is no faster for loads
than movups on current CPUs, so the main question is whether the
compiler chooses to generate a remainder loop for alignment. Such a
remainder loop will interfere with efficiency of short loop vectorization.
--
Tim Prince