_mm_malloc()

nibbs22@xxxxxxx · Wed, 14 Apr 2010 18:46:22 -0400

Hello, thank you for your time. My name is Kevin

I am a student doing micro-benchmark checks of gcc vs icc on 
autovectorization.
I am using gcc --version
gcc (GCC) 4.4.3 20100127 (Red Hat 4.4.3-4)

The architecture i am testing on is:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Duo CPU     E8300  @ 2.83GHz
stepping        : 6
cpu MHz         : 2833.148
cache size      : 6144 KB
...

I am specifically using an example 1 on the below website.
gcc.gnu.org/projects/tree-ssa/vectorization.html

I have contacted the authors of the website and they recommended I 
forward my question to you.

My question concerns a vectorization of code from the site and the 
aligned property of
pointers returned from _mm_malloc.

I will use an example from the above site, modified slightly:

example1:
// _mm_malloc the arrays
int M = 4*1024*1024

 if (( A  = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {

 printf("ERROR ALLOCATING mybuffer1\n");

 exit(1);  }

 if (( B   = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {

 printf("ERROR ALLOCATING mybuffer2\n");

 fflush(stderr);

 exit(1);  }

 if (( C = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {

 printf("ERROR ALLOCATING mybuffer3\n");

 exit(1);  }
int i;
for (i=0; i<M; i++)  A[i] = B[i] + C[i];

I will also attach the entire compilable file to this email. So you can 
compile it if you wish.

My question is,
When I use
gcc -O3 -msse4 -ftree-vectorizer-verbose=6 example1.c

I get (among a lot of other stuff):

example1.c:75: note: Alignment of access forced using peeling.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: Vectorizing an unaligned access.

It is my impression that I align all arrays using _mm_malloc.
For clarification: line 75 is the for loop above and arrays a, b, and c 
are allocated with _mm_malloc().

I thought I was guaranteed to have  _mm_malloc return array addresses 
that are aligned to 16 bytes
and that the compiler would recognize this.

Is that right? If so, then why do I get 'vectorizing an unaligned 
access'?

Thanks for your time :)
Kevin

P.S.
Do you have a good source on the use of information that comes back 
from gcc vectorizer_verbose?
lincoln> gcc -O3 -msse4 -ftree-vectorizer-verbose=6 example1.c
example1.c:75: note: Alignment of access forced using peeling.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: Vectorizing an unaligned access.
example1.c:75: note: vect_model_load_cost: unaligned supported by 
hardware.
example1.c:75: note: vect_model_load_cost: inside_cost = 2, 
outside_cost = 0 .
example1.c:75: note: vect_model_load_cost: unaligned supported by 
hardware.
example1.c:75: note: vect_model_load_cost: inside_cost = 2, 
outside_cost = 0 .
example1.c:75: note: vect_model_simple_cost: inside_cost = 1, 
outside_cost = 0 .
example1.c:75: note: vect_model_store_cost: inside_cost = 1, 
outside_cost = 0 .
example1.c:75: note: cost model: prologue peel iters set to vf/2.
example1.c:75: note: cost model: epilogue peel iters set to vf/2 
because peeling for alignment is unknown .
example1.c:75: note: Cost model analysis:
 Vector inside of loop cost: 6
 Vector outside of loop cost: 24
 Scalar iteration cost: 4
 Scalar outside cost: 0
 prologue iterations: 2
 epilogue iterations: 2
 Calculated minimum iters for profitability: 8
example1.c:75: note:   Profitability threshold = 7
example1.c:75: note: Vectorization may not be profitable.
example1.c:75: note: LOOP VECTORIZED.
example1.c:67: note: not vectorized: unhandled data-ref
example1.c:17: note: vectorized 1 loops in function.

There are a lot of terms that seem to convey a lot of info that i'm not 
sure how to use?
inside cost?
outside cost?
prologue peel iters?
epilogye peel iters?
Vector inside of loop cost?
Calculated minimum iters for profitability: 8?

This seems to indicate that if I iterate over my loop for at least 8 
times I should see a performance increase.
I iterate over the loop 4 million times, yet the compiler responds with:
"example1.c:75: note: Vectorization may not be profitable."
Just to be precise, line 75 is:   for (i=0; i<M; i++)  A[i] = B[i] + 
C[i];

Thanks again if you can spend just a few minutes commenting on my query.
Thanks.
#include <xmmintrin.h>
#include <stdio.h>

#include <stdlib.h>
#include <math.h>
#include <time.h>

#include <stdint.h>

// forward declaration

float poor_random_float();

int main(int argc, char* argv[]) {

  //allocate some pointers that will be large float arrays.

  // don't really need to align at this point because _mm_malloc should take care of that?

  float*  __restrict__ A __attribute__ ((aligned(16)));
  float*  __restrict__ B __attribute__ ((aligned(16)));
  float*  __restrict__ C __attribute__ ((aligned(16)));

  float*  __restrict__ D __attribute__ ((aligned(16)));
  float  E[4] __attribute__ ((aligned(16)));

  //float*  A __attribute__ ((aligned(16)));
  //float*  B __attribute__ ((aligned(16)));
  //float*  C __attribute__ ((aligned(16)));

  //float*  A;
  //float*  B;
  //float*  C;

  const int M= 4*1024*1024;
  //const int N= 4*1024*1023;

  if (( A  = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {
  printf("ERROR ALLOCATING mybuffer1\n");
  exit(1);
  }
  if (( B   = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {
  printf("ERROR ALLOCATING mybuffer2\n");
  fflush(stderr);
  exit(1);
  }
  if (( C = (float*) _mm_malloc( M*sizeof(float),16) ) == NULL) {
  printf("ERROR ALLOCATING mybuffer3\n");
  exit(1);
  }
  if (( D = (float*) _mm_malloc( 4*sizeof(float),16) ) == NULL) {
  printf("ERROR ALLOCATING mybuffer3\n");
  exit(1);
  }

  // fill the arrays with garbage.

  int i;

  for (i=0;i<M;i++){
    B[i] = poor_random_float();
    C[i] = poor_random_float();
  }

  for (i=0; i<M; i++){
    A[i] = B[i] + C[i];
  }

}// END MAIN

float poor_random_float()
{
  // generate a random float using the standard algorithm
  int x;
  float f;

  x = rand();
  f = ((float) x / (float)RAND_MAX) + (float)1.f;

  return f;
}