Re: Loop Vectorization and OpenMP

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/14/2013 9:33 AM, Freddie Witherden wrote:
Hi all,

I have a function which I wish to accelerate with auto-vectorization and
OpenMP:

void fn(float *restrict rho_in,     float *restrict E_in,
         float *restrict rhou_in,    float *restrict rhov_in,
         float *restrict f0rho_out,  float *restrict f0E_out,
         float *restrict f0rhou_out, float *restrict f0rhov_out,
         float *restrict f1rho_out,  float *restrict f1E_out,
         float *restrict f1rhou_out, float *restrict f1rhov_out,
         int n)
{
     rho_in  = (float *) __builtin_assume_aligned(rho_in, 32);
     E_in    = (float *) __builtin_assume_aligned(E_in, 32);
     rhou_in = (float *) __builtin_assume_aligned(rhou_in, 32);
     rhov_in = (float *) __builtin_assume_aligned(rhov_in, 32);

     f0rho_out  = (float *) __builtin_assume_aligned(f0rho_out, 32);
     f0E_out    = (float *) __builtin_assume_aligned(f0E_out, 32);
     f0rhou_out = (float *) __builtin_assume_aligned(f0rhou_out, 32);
     f0rhov_out = (float *) __builtin_assume_aligned(f0rhov_out, 32);

     f1rho_out  = (float *) __builtin_assume_aligned(f1rho_out, 32);
     f1E_out    = (float *) __builtin_assume_aligned(f1E_out, 32);
     f1rhou_out = (float *) __builtin_assume_aligned(f1rhou_out, 32);
     f1rhov_out = (float *) __builtin_assume_aligned(f1rhov_out, 32);

     #pragma omp parallel for
     for (int i = 0; i < n; ++i)
     {
         float rho = rho_in[i], E = E_in[i];
         float rhou = rhou_in[i], rhov = rhov_in[i];

         float invrho = 1.0f/rho;
         float u = invrho*rhou, v = invrho*rhov;

         float p = 0.4f*(E - 0.5f*(rhou*u + rhov*v));

         f0rho_out[i]  = rhou;       f1rho_out[i]  = rhov;
         f0rhou_out[i] = rhou*u + p; f1rhou_out[i] = rhov*u;
         f0rhov_out[i] = rhou*v;     f1rhov_out[i] = rhov*v + p;
         f0E_out[i]    = (E + p)*u;  f1E_out[i]    = (E + p)*v;
     }
}

the combination of "restrict" along with the alignment fluff yields some
extremely tight ASM on my AVX-capable system.  However, when OpenMP
enters the mix the resulting code is not vectorized:

   gcc-4.7.2 -std=c99 -Ofast -fopenmp -march=native -S fn.c

as can be seen by a simple inspection of the resulting assembly.  I
believe this is due to Bug 46032 (although some of the comments imply
that it should be fixed).  It appears as if either the "restrict"
properly or the alignment is getting clobbered when the OpenMP 'inner'
function is generated.

Can anyone suggest any workarounds?  It seems like a common problem and
really do not want to reinvent the wheel if a simple refactoring of my
code can iron everything out.

Regards, Freddie.
It's a Frequently Encountered Problem. What did -ftree-vectorizer-verbose=3 produce? Part of the problem is that the OpenMP chunks won't have the alignments you set carefully for the start of the array, unless the loop count happens to be a multiple of number of threads times unrolling factor times vector register width, thus unknown at compile time. It remains to be seen how much OpenMP 4.0 proposals for pragmas to deal with this may help. Until then, OpenMP tends to work better with at least 2 levels of loops, where the outer is parallelizable and the inner vectorizable.
Tim

--
Tim Prince



[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux