Tips for improving OpenMP offload performance on NVIDIA GPU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I am having trouble getting good performance with GCC’s OpenMP
offloading to NVPTX targets. I am using the Jacobi iteration, an
example often used for OpenMP offloading.

Computer:
* CPU: 2 x Intel Xeon Gold 5115 (20 cores, 40 threads total)
* System memory: 192 GB
* GPU: NVIDIA Geforce RTX 2080 (CUDA compute capability 7.5)
* OS: Fedora 30 (Linux kernel 5.3.11)

GCC version (trunk):
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/eraut/software/gcc-trunk/gcc_nvptx_offload/libexec/gcc/x86_64-pc-linux-gnu/10.0.0/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
Target: x86_64-pc-linux-gnu
Configured with: ../gcc_source/configure
--prefix=/home/eraut/software/gcc-trunk/gcc_nvptx_offload
--disable-multilib --disable-bootstrap --enable-checking=release
--enable-languages=c,c++,fortran,lto
-enable-offload-targets=nvptx-none
--with-cuda-driver-lib=/opt/cuda/10.1.168/lib64
--with-cuda-driver-include=/opt/cuda/10.1.168/include
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.0.0 20191117 (experimental) (GCC)

The code looks like this:

#pragma omp target data map(to:A[0:N]) map(alloc:Anew[0:N])
{
    const int iter_max = 10000;
    int iter = 0;
    while ( error > tol && iter < iter_max ) {
        error = 0.0;
        #pragma omp target map(error)
        #pragma omp teams distribute parallel for simd
reduction(max:error) collapse(2)
        for (int i = 1; i < n-1; i++) {
            for (int j = 1; j < n-1; j++) {
                Anew[i*n+j] = 0.25 * ( A[i*n+j+1] + A[i*n+j-1] +
                                      A[(i-1)*n+j] + A[(i+1)*n+j] );
                error = fmax(error, fabs(Anew[i*n+j] - A[i*n+j]));
            }
        }
        // Update A
        #pragma omp target teams distribute parallel for simd collapse(2)
        for (int i = 1; i < n-1; i++) {
            for (int j = 1; j < n-1; j++) {
                A[i*n+j] = Anew[i*n+j];
            }
        }
        ++iter;
        if (iter % 100 == 0)
            printf("Iteration %d, error = %g\n", iter, error);
    }
}

With a matrix size of 1000x1000, and the code compiled with "-O3",
this takes about 5 seconds using Clang and about 24 seconds using GCC.
For reference, it only takes about 0.78 seconds using PGI+OpenACC with
the corresponding OpenACC directives.

Output from nvprof, as shown below, indicates that about half of the
time (in the "API calls" section) is being spent on "cuMemAlloc" and
"cuMemFree". This is a bit surprising since this kernel is structured
such that all the data is kept on the GPU and only the "error" scalar
is transferred to and from the device at each timestep. Investigation
with the visual profiler shows that cuMemAlloc and cuMemFree are being
called before and after (respectively) each omp target region, and
that these calls take longer than the kernel itself.

    ==5858== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min
     Max  Name
     GPU activities:   77.87%  8.65573s      4467  1.9377ms  1.8907ms
2.4392ms  main$_omp_fn$0
                       22.00%  2.44538s      4467  547.43us  511.97us
669.15us  main$_omp_fn$2
                        0.08%  8.7558ms      8935     979ns     672ns
1.6478ms  [CUDA memcpy HtoD]
                        0.05%  5.3973ms      4467  1.2080us     960ns
13.344us  [CUDA memcpy DtoH]
          API calls:   48.06%  11.1161s      8934  1.2442ms  512.74us
2.4412ms  cuCtxSynchronize
                       28.45%  6.58100s     17869  368.29us  6.4930us
162.56ms  cuMemAlloc
                       21.10%  4.88023s     17869  273.11us  3.3220us
14.474ms  cuMemFree
                        0.87%  201.07ms         1  201.07ms  201.07ms
201.07ms  cuCtxCreate
                        0.48%  111.48ms      8934  12.478us  9.4160us
599.79us  cuLaunchKernel
    ...

Does anyone have any ideas what could be the problem, or any
suggestions that might improve the performance here? I've tried moving
the "parallel for simd" to the inner for loop and similar changes, but
none of these seem to help very much.

Thanks very much!
Eric




[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux