[OpenACC] Performance issues on simple example program

Christopher Guckes <chris@xxxxxxxxxxxxxxxxxxx> · Tue, 21 Jun 2016 19:27:43 +0200

I'm currently capable of compiling and running the PI example from
http://scelementary.com/2015/04/25/openacc-in-gcc.html with the current
GCC 6.1.0. The GPU version of the code is much slower than the CPU
version and I can't figure out why. I didn't have this problem with GCC
5.3.0 before.

The code looks as follows:

#include <stdio.h>
#include <stdlib.h>

#define N 200000000

int main(void) {
  double pi = 0.0f;
  long long i;

  #pragma acc data copyout(pi)
  {
    #pragma acc parallel loop reduction (+:pi) present (pi)
    for (i=0; i<N; i++) {
      double t= (double)((i+0.5)/N);
      pi +=4.0/(1.0+t*t);
    }
  }

  printf("pi=%11.10f\n",pi/N);

  return 0;
}

The GPU version takes about four times as long as the CPU version of the
code. I used the NVIDIA visual profiler to ensure it wasn't a copy
operation that tanked the runtime. Copying was measured at 0.1% while
the kernel itself runs for about six seconds on a GTX 970. The profiler
tells me that the occupancy is at 1.6% giving the grid size as the
limiting factor. I'm quite new to GPU code, so I'm not sure what to do
about that. The original sample code used a vector length of 1024, the
default seems to be 32 in the current GCC 6.1.0 version. When I try to
set the vector length to 1024 manually it warns me that it will ignore
that. What else can I try to get this to run faster?

Thanks in advance
Chris

Attachment:
signature.asc

Description: OpenPGP digital signature