Re: Loop Vectorization and OpenMP

Tim Prince <n8tm@xxxxxxx> · Mon, 14 Jan 2013 10:04:59 -0600

On 1/14/2013 9:33 AM, Freddie Witherden wrote:
Hi all,

I have a function which I wish to accelerate with auto-vectorization and
OpenMP:

void fn(float *restrict rho_in,     float *restrict E_in,
         float *restrict rhou_in,    float *restrict rhov_in,
         float *restrict f0rho_out,  float *restrict f0E_out,
         float *restrict f0rhou_out, float *restrict f0rhov_out,
         float *restrict f1rho_out,  float *restrict f1E_out,
         float *restrict f1rhou_out, float *restrict f1rhov_out,
         int n)
{
     rho_in  = (float *) __builtin_assume_aligned(rho_in, 32);
     E_in    = (float *) __builtin_assume_aligned(E_in, 32);
     rhou_in = (float *) __builtin_assume_aligned(rhou_in, 32);
     rhov_in = (float *) __builtin_assume_aligned(rhov_in, 32);

     f0rho_out  = (float *) __builtin_assume_aligned(f0rho_out, 32);
     f0E_out    = (float *) __builtin_assume_aligned(f0E_out, 32);
     f0rhou_out = (float *) __builtin_assume_aligned(f0rhou_out, 32);
     f0rhov_out = (float *) __builtin_assume_aligned(f0rhov_out, 32);

     f1rho_out  = (float *) __builtin_assume_aligned(f1rho_out, 32);
     f1E_out    = (float *) __builtin_assume_aligned(f1E_out, 32);
     f1rhou_out = (float *) __builtin_assume_aligned(f1rhou_out, 32);
     f1rhov_out = (float *) __builtin_assume_aligned(f1rhov_out, 32);

     #pragma omp parallel for
     for (int i = 0; i < n; ++i)
     {
         float rho = rho_in[i], E = E_in[i];
         float rhou = rhou_in[i], rhov = rhov_in[i];

         float invrho = 1.0f/rho;
         float u = invrho*rhou, v = invrho*rhov;

         float p = 0.4f*(E - 0.5f*(rhou*u + rhov*v));

         f0rho_out[i]  = rhou;       f1rho_out[i]  = rhov;
         f0rhou_out[i] = rhou*u + p; f1rhou_out[i] = rhov*u;
         f0rhov_out[i] = rhou*v;     f1rhov_out[i] = rhov*v + p;
         f0E_out[i]    = (E + p)*u;  f1E_out[i]    = (E + p)*v;
     }
}

the combination of "restrict" along with the alignment fluff yields some
extremely tight ASM on my AVX-capable system.  However, when OpenMP
enters the mix the resulting code is not vectorized:

   gcc-4.7.2 -std=c99 -Ofast -fopenmp -march=native -S fn.c

as can be seen by a simple inspection of the resulting assembly.  I
believe this is due to Bug 46032 (although some of the comments imply
that it should be fixed).  It appears as if either the "restrict"
properly or the alignment is getting clobbered when the OpenMP 'inner'
function is generated.

Can anyone suggest any workarounds?  It seems like a common problem and
really do not want to reinvent the wheel if a simple refactoring of my
code can iron everything out.

Regards, Freddie.
It's a Frequently Encountered Problem.  What did 
-ftree-vectorizer-verbose=3 produce?
Part of the problem is that the OpenMP chunks won't have the alignments 
you set carefully for the start of the array, unless the loop count 
happens to be a multiple of number of threads times unrolling factor 
times vector register width, thus unknown at compile time.
It remains to be seen how much OpenMP 4.0 proposals for pragmas to deal 
with this may help.
Until then, OpenMP tends to work better with at least 2 levels of loops, 
where the outer is parallelizable and the inner vectorizable.
Tim

--
Tim Prince