Hello, I am having trouble getting good performance with GCC’s OpenMP offloading to NVPTX targets. I am using the Jacobi iteration, an example often used for OpenMP offloading. Computer: * CPU: 2 x Intel Xeon Gold 5115 (20 cores, 40 threads total) * System memory: 192 GB * GPU: NVIDIA Geforce RTX 2080 (CUDA compute capability 7.5) * OS: Fedora 30 (Linux kernel 5.3.11) GCC version (trunk): Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/home/eraut/software/gcc-trunk/gcc_nvptx_offload/libexec/gcc/x86_64-pc-linux-gnu/10.0.0/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none Target: x86_64-pc-linux-gnu Configured with: ../gcc_source/configure --prefix=/home/eraut/software/gcc-trunk/gcc_nvptx_offload --disable-multilib --disable-bootstrap --enable-checking=release --enable-languages=c,c++,fortran,lto -enable-offload-targets=nvptx-none --with-cuda-driver-lib=/opt/cuda/10.1.168/lib64 --with-cuda-driver-include=/opt/cuda/10.1.168/include Thread model: posix Supported LTO compression algorithms: zlib gcc version 10.0.0 20191117 (experimental) (GCC) The code looks like this: #pragma omp target data map(to:A[0:N]) map(alloc:Anew[0:N]) { const int iter_max = 10000; int iter = 0; while ( error > tol && iter < iter_max ) { error = 0.0; #pragma omp target map(error) #pragma omp teams distribute parallel for simd reduction(max:error) collapse(2) for (int i = 1; i < n-1; i++) { for (int j = 1; j < n-1; j++) { Anew[i*n+j] = 0.25 * ( A[i*n+j+1] + A[i*n+j-1] + A[(i-1)*n+j] + A[(i+1)*n+j] ); error = fmax(error, fabs(Anew[i*n+j] - A[i*n+j])); } } // Update A #pragma omp target teams distribute parallel for simd collapse(2) for (int i = 1; i < n-1; i++) { for (int j = 1; j < n-1; j++) { A[i*n+j] = Anew[i*n+j]; } } ++iter; if (iter % 100 == 0) printf("Iteration %d, error = %g\n", iter, error); } } With a matrix size of 1000x1000, and the code compiled with "-O3", this takes about 5 seconds using Clang and about 24 seconds using GCC. For reference, it only takes about 0.78 seconds using PGI+OpenACC with the corresponding OpenACC directives. Output from nvprof, as shown below, indicates that about half of the time (in the "API calls" section) is being spent on "cuMemAlloc" and "cuMemFree". This is a bit surprising since this kernel is structured such that all the data is kept on the GPU and only the "error" scalar is transferred to and from the device at each timestep. Investigation with the visual profiler shows that cuMemAlloc and cuMemFree are being called before and after (respectively) each omp target region, and that these calls take longer than the kernel itself. ==5858== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 77.87% 8.65573s 4467 1.9377ms 1.8907ms 2.4392ms main$_omp_fn$0 22.00% 2.44538s 4467 547.43us 511.97us 669.15us main$_omp_fn$2 0.08% 8.7558ms 8935 979ns 672ns 1.6478ms [CUDA memcpy HtoD] 0.05% 5.3973ms 4467 1.2080us 960ns 13.344us [CUDA memcpy DtoH] API calls: 48.06% 11.1161s 8934 1.2442ms 512.74us 2.4412ms cuCtxSynchronize 28.45% 6.58100s 17869 368.29us 6.4930us 162.56ms cuMemAlloc 21.10% 4.88023s 17869 273.11us 3.3220us 14.474ms cuMemFree 0.87% 201.07ms 1 201.07ms 201.07ms 201.07ms cuCtxCreate 0.48% 111.48ms 8934 12.478us 9.4160us 599.79us cuLaunchKernel ... Does anyone have any ideas what could be the problem, or any suggestions that might improve the performance here? I've tried moving the "parallel for simd" to the inner for loop and similar changes, but none of these seem to help very much. Thanks very much! Eric