On 4/14/2010 3:46 PM, nibbs22@xxxxxxx wrote:
Hello, thank you for your time. My name is Kevin
I am a student doing micro-benchmark checks of gcc vs icc on
autovectorization.
I am using gcc --version
gcc (GCC) 4.4.3 20100127 (Red Hat 4.4.3-4)
The architecture i am testing on is:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU E8300 @ 2.83GHz
stepping : 6
cpu MHz : 2833.148
cache size : 6144 KB
...
I am specifically using an example 1 on the below website.
gcc.gnu.org/projects/tree-ssa/vectorization.html
I have contacted the authors of the website and they recommended I
forward my question to you.
My question concerns a vectorization of code from the site and the
aligned property of
pointers returned from _mm_malloc.
I will use an example from the above site, modified slightly:
example1:
// _mm_malloc the arrays
int M = 4*1024*1024
if (( A = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {
printf("ERROR ALLOCATING mybuffer1\n");
exit(1); }
if (( B = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {
printf("ERROR ALLOCATING mybuffer2\n");
fflush(stderr);
exit(1); }
if (( C = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {
printf("ERROR ALLOCATING mybuffer3\n");
exit(1); }
int i;
for (i=0; i<M; i++) A[i] = B[i] + C[i];
I will also attach the entire compilable file to this email. So you
can compile it if you wish.
My question is,
When I use
gcc -O3 -msse4 -ftree-vectorizer-verbose=6 example1.c
I get (among a lot of other stuff):
example1.c:75: note: Alignment of access forced using peeling.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: Vectorizing an unaligned access.
It is my impression that I align all arrays using _mm_malloc.
For clarification: line 75 is the for loop above and arrays a, b, and
c are allocated with _mm_malloc().
I thought I was guaranteed to have _mm_malloc return array addresses
that are aligned to 16 bytes
and that the compiler would recognize this.
Is that right? If so, then why do I get 'vectorizing an unaligned
access'?
Thanks for your time :)
Kevin
P.S.
Do you have a good source on the use of information that comes back
from gcc vectorizer_verbose?
lincoln> gcc -O3 -msse4 -ftree-vectorizer-verbose=6 example1.c
example1.c:75: note: Alignment of access forced using peeling.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: vect_model_load_cost: unaligned supported by
hardware.
example1.c:75: note: vect_model_load_cost: inside_cost = 2,
outside_cost = 0 .
example1.c:75: note: vect_model_load_cost: unaligned supported by
hardware.
example1.c:75: note: vect_model_load_cost: inside_cost = 2,
outside_cost = 0 .
example1.c:75: note: vect_model_simple_cost: inside_cost = 1,
outside_cost = 0 .
example1.c:75: note: vect_model_store_cost: inside_cost = 1,
outside_cost = 0 .
example1.c:75: note: cost model: prologue peel iters set to vf/2.
example1.c:75: note: cost model: epilogue peel iters set to vf/2
because peeling for alignment is unknown .
example1.c:75: note: Cost model analysis:
Vector inside of loop cost: 6
Vector outside of loop cost: 24
Scalar iteration cost: 4
Scalar outside cost: 0
prologue iterations: 2
epilogue iterations: 2
Calculated minimum iters for profitability: 8
example1.c:75: note: Profitability threshold = 7
example1.c:75: note: Vectorization may not be profitable.
example1.c:75: note: LOOP VECTORIZED.
example1.c:67: note: not vectorized: unhandled data-ref
example1.c:17: note: vectorized 1 loops in function.
There are a lot of terms that seem to convey a lot of info that i'm
not sure how to use?
inside cost?
outside cost?
prologue peel iters?
epilogye peel iters?
Vector inside of loop cost?
Calculated minimum iters for profitability: 8?
This seems to indicate that if I iterate over my loop for at least 8
times I should see a performance increase.
I iterate over the loop 4 million times, yet the compiler responds with:
"example1.c:75: note: Vectorization may not be profitable."
Just to be precise, line 75 is: for (i=0; i<M; i++) A[i] = B[i] +
C[i];
Thanks again if you can spend just a few minutes commenting on my query.
Thanks.
Apparently, the compiler doesn't trust mm_malloc to present an aligned
access. So it will generate code to check at run time whether A is
aligned; of not, it will "peel" enough scalar iterations to get to an
aligned address. Then, not being certain that B or C are aligned, it
will provide for unaligned access. I believe your CPU is a Penryn style
where you could force unaligned 128-bit, rather than split, loads by
-mtune=barcelona and perhaps see better performance, as your data should
be aligned in practice. The barcelona option may also cancel the
remarks about "may not be profitable." With such a large iteration
count, performance would be improved by non-temporal stores to A. I
guess you may be using the x86_64 compiler, where the default value for
-march is reasonable (don't know if that version of 32-bit gcc would
default to -march=native).
--
Tim Prince