I'm currently capable of compiling and running the PI example from http://scelementary.com/2015/04/25/openacc-in-gcc.html with the current GCC 6.1.0. The GPU version of the code is much slower than the CPU version and I can't figure out why. I didn't have this problem with GCC 5.3.0 before. The code looks as follows: #include <stdio.h> #include <stdlib.h> #define N 200000000 int main(void) { double pi = 0.0f; long long i; #pragma acc data copyout(pi) { #pragma acc parallel loop reduction (+:pi) present (pi) for (i=0; i<N; i++) { double t= (double)((i+0.5)/N); pi +=4.0/(1.0+t*t); } } printf("pi=%11.10f\n",pi/N); return 0; } The GPU version takes about four times as long as the CPU version of the code. I used the NVIDIA visual profiler to ensure it wasn't a copy operation that tanked the runtime. Copying was measured at 0.1% while the kernel itself runs for about six seconds on a GTX 970. The profiler tells me that the occupancy is at 1.6% giving the grid size as the limiting factor. I'm quite new to GPU code, so I'm not sure what to do about that. The original sample code used a vector length of 1024, the default seems to be 32 in the current GCC 6.1.0 version. When I try to set the vector length to 1024 manually it warns me that it will ignore that. What else can I try to get this to run faster? Thanks in advance Chris
Attachment:
signature.asc
Description: OpenPGP digital signature