Re: _mm_malloc()

Tim Prince <n8tm@xxxxxxx> · Wed, 14 Apr 2010 17:02:17 -0700

On 4/14/2010 3:46 PM, nibbs22@xxxxxxx wrote:
Hello, thank you for your time. My name is Kevin

I am a student doing micro-benchmark checks of gcc vs icc on 
autovectorization.
I am using gcc --version
gcc (GCC) 4.4.3 20100127 (Red Hat 4.4.3-4)

The architecture i am testing on is:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E8300  @ 2.83GHz
stepping        : 6
cpu MHz         : 2833.148
cache size      : 6144 KB
...

I am specifically using an example 1 on the below website.
gcc.gnu.org/projects/tree-ssa/vectorization.html

I have contacted the authors of the website and they recommended I 
forward my question to you.

My question concerns a vectorization of code from the site and the 
aligned property of
pointers returned from _mm_malloc.

I will use an example from the above site, modified slightly:

example1:
// _mm_malloc the arrays
int M = 4*1024*1024

 if (( A  = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {

 printf("ERROR ALLOCATING mybuffer1\n");

 exit(1);  }

 if (( B   = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {

 printf("ERROR ALLOCATING mybuffer2\n");

 fflush(stderr);

 exit(1);  }

 if (( C = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {

 printf("ERROR ALLOCATING mybuffer3\n");

 exit(1);  }
int i;
for (i=0; i<M; i++)  A[i] = B[i] + C[i];

I will also attach the entire compilable file to this email. So you 
can compile it if you wish.

My question is,
When I use
gcc -O3 -msse4 -ftree-vectorizer-verbose=6 example1.c

I get (among a lot of other stuff):

example1.c:75: note: Alignment of access forced using peeling.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: Vectorizing an unaligned access.

It is my impression that I align all arrays using _mm_malloc.
For clarification: line 75 is the for loop above and arrays a, b, and 
c are allocated with _mm_malloc().

I thought I was guaranteed to have  _mm_malloc return array addresses 
that are aligned to 16 bytes
and that the compiler would recognize this.

Is that right? If so, then why do I get 'vectorizing an unaligned 
access'?

Thanks for your time :)
Kevin

P.S.
Do you have a good source on the use of information that comes back 
from gcc vectorizer_verbose?
lincoln> gcc -O3 -msse4 -ftree-vectorizer-verbose=6 example1.c
example1.c:75: note: Alignment of access forced using peeling.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: vect_model_load_cost: unaligned supported by 
hardware.
example1.c:75: note: vect_model_load_cost: inside_cost = 2, 
outside_cost = 0 .
example1.c:75: note: vect_model_load_cost: unaligned supported by 
hardware.
example1.c:75: note: vect_model_load_cost: inside_cost = 2, 
outside_cost = 0 .
example1.c:75: note: vect_model_simple_cost: inside_cost = 1, 
outside_cost = 0 .
example1.c:75: note: vect_model_store_cost: inside_cost = 1, 
outside_cost = 0 .
example1.c:75: note: cost model: prologue peel iters set to vf/2.
example1.c:75: note: cost model: epilogue peel iters set to vf/2 
because peeling for alignment is unknown .
example1.c:75: note: Cost model analysis:
 Vector inside of loop cost: 6
 Vector outside of loop cost: 24
 Scalar iteration cost: 4
 Scalar outside cost: 0
 prologue iterations: 2
 epilogue iterations: 2
 Calculated minimum iters for profitability: 8
example1.c:75: note:   Profitability threshold = 7
example1.c:75: note: Vectorization may not be profitable.
example1.c:75: note: LOOP VECTORIZED.
example1.c:67: note: not vectorized: unhandled data-ref
example1.c:17: note: vectorized 1 loops in function.

There are a lot of terms that seem to convey a lot of info that i'm 
not sure how to use?
inside cost?
outside cost?
prologue peel iters?
epilogye peel iters?
Vector inside of loop cost?
Calculated minimum iters for profitability: 8?

This seems to indicate that if I iterate over my loop for at least 8 
times I should see a performance increase.
I iterate over the loop 4 million times, yet the compiler responds with:
"example1.c:75: note: Vectorization may not be profitable."
Just to be precise, line 75 is:   for (i=0; i<M; i++)  A[i] = B[i] + 
C[i];

Thanks again if you can spend just a few minutes commenting on my query.
Thanks.
Apparently, the compiler doesn't trust mm_malloc to present an aligned 
access.   So it will generate code to check at run time whether A is 
aligned; of not, it will "peel" enough scalar iterations to get to an 
aligned address.  Then, not being certain that B or C are aligned, it 
will provide for unaligned access.  I believe your CPU is a Penryn style 
where you could force unaligned 128-bit, rather than split, loads by 
-mtune=barcelona and perhaps see better performance, as your data should 
be aligned in practice.  The barcelona option may also cancel the 
remarks about "may not be profitable."  With such a large iteration 
count, performance would be improved by non-temporal stores to A.  I 
guess you may be using the x86_64 compiler, where the default value for 
-march is reasonable (don't know if that version of 32-bit gcc would 
default to -march=native).

--
Tim Prince