Re: _mm_malloc()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/14/2010 3:46 PM, nibbs22@xxxxxxx wrote:
Hello, thank you for your time. My name is Kevin

I am a student doing micro-benchmark checks of gcc vs icc on autovectorization.
I am using gcc --version
gcc (GCC) 4.4.3 20100127 (Red Hat 4.4.3-4)

The architecture i am testing on is:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E8300  @ 2.83GHz
stepping        : 6
cpu MHz         : 2833.148
cache size      : 6144 KB
...

I am specifically using an example 1 on the below website.
gcc.gnu.org/projects/tree-ssa/vectorization.html

I have contacted the authors of the website and they recommended I forward my question to you.

My question concerns a vectorization of code from the site and the aligned property of
pointers returned from _mm_malloc.

I will use an example from the above site, modified slightly:

example1:
// _mm_malloc the arrays
int M = 4*1024*1024

 if (( A  = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {

 printf("ERROR ALLOCATING mybuffer1\n");

 exit(1);  }


 if (( B   = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {

 printf("ERROR ALLOCATING mybuffer2\n");

 fflush(stderr);

 exit(1);  }


 if (( C = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {

 printf("ERROR ALLOCATING mybuffer3\n");

 exit(1);  }
int i;
for (i=0; i<M; i++)  A[i] = B[i] + C[i];






I will also attach the entire compilable file to this email. So you can compile it if you wish.

My question is,
When I use
gcc -O3 -msse4 -ftree-vectorizer-verbose=6 example1.c


I get (among a lot of other stuff):

example1.c:75: note: Alignment of access forced using peeling.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: Vectorizing an unaligned access.



It is my impression that I align all arrays using _mm_malloc.
For clarification: line 75 is the for loop above and arrays a, b, and c are allocated with _mm_malloc().

I thought I was guaranteed to have _mm_malloc return array addresses that are aligned to 16 bytes
and that the compiler would recognize this.

Is that right? If so, then why do I get 'vectorizing an unaligned access'?

Thanks for your time :)
Kevin

P.S.
Do you have a good source on the use of information that comes back from gcc vectorizer_verbose?
lincoln> gcc -O3 -msse4 -ftree-vectorizer-verbose=6 example1.c
example1.c:75: note: Alignment of access forced using peeling.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: vect_model_load_cost: unaligned supported by hardware. example1.c:75: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 . example1.c:75: note: vect_model_load_cost: unaligned supported by hardware. example1.c:75: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 . example1.c:75: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 . example1.c:75: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
example1.c:75: note: cost model: prologue peel iters set to vf/2.
example1.c:75: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
example1.c:75: note: Cost model analysis:
 Vector inside of loop cost: 6
 Vector outside of loop cost: 24
 Scalar iteration cost: 4
 Scalar outside cost: 0
 prologue iterations: 2
 epilogue iterations: 2
 Calculated minimum iters for profitability: 8
example1.c:75: note:   Profitability threshold = 7
example1.c:75: note: Vectorization may not be profitable.
example1.c:75: note: LOOP VECTORIZED.
example1.c:67: note: not vectorized: unhandled data-ref
example1.c:17: note: vectorized 1 loops in function.


There are a lot of terms that seem to convey a lot of info that i'm not sure how to use?
inside cost?
outside cost?
prologue peel iters?
epilogye peel iters?
Vector inside of loop cost?
Calculated minimum iters for profitability: 8?

This seems to indicate that if I iterate over my loop for at least 8 times I should see a performance increase.
I iterate over the loop 4 million times, yet the compiler responds with:
"example1.c:75: note: Vectorization may not be profitable."
Just to be precise, line 75 is: for (i=0; i<M; i++) A[i] = B[i] + C[i];


Thanks again if you can spend just a few minutes commenting on my query.
Thanks.
Apparently, the compiler doesn't trust mm_malloc to present an aligned access. So it will generate code to check at run time whether A is aligned; of not, it will "peel" enough scalar iterations to get to an aligned address. Then, not being certain that B or C are aligned, it will provide for unaligned access. I believe your CPU is a Penryn style where you could force unaligned 128-bit, rather than split, loads by -mtune=barcelona and perhaps see better performance, as your data should be aligned in practice. The barcelona option may also cancel the remarks about "may not be profitable." With such a large iteration count, performance would be improved by non-temporal stores to A. I guess you may be using the x86_64 compiler, where the default value for -march is reasonable (don't know if that version of 32-bit gcc would default to -march=native).

--
Tim Prince


[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux