OpenMP nvptx and amdgcn offloading - questions on speed, grid size and printf

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear list,

Happy new year!

I spent the past few days trying to port a previously developed CUDA photon simulator (https://github.com/fangq/mcx) to OpenMP 5.x.

After receiving some help from MatColgrove@NVIDIA forum (https://forums.developer.nvidia.com/t/how-to-use-openmp-map-directive-to-map-dynamic-array-inside-a-struct-class-to-the-gpu/318270), I was able to successfully compile the code using both g++-12 with nvptx-none offloading as well as NVIDIA nvc++. I would like to get some help from this list as I noticed that there is a dramatic (50x) speed difference between g++ offloaded binary vs nvc++ compiled version.

The working code (~400 lines) can be accessed at https://github.com/fangq/umcx/blob/main/src/umcx.cpp, specifically, the core of the code is an omp parallel for section shown below


|#ifdef _LIBGOMP_OMP_LOCK_DEFINED
    const int gridsize = 200000 / 64, blocksize = 2;  // gcc nvptx offloading uses {32,teams_thread_limit,1} as blockdim
#else
    const int gridsize = 200000 / 64, blocksize = 64; // nvc uses {num_teams,1,1} as griddim and {teams_thread_limit,1,1} as blockdim
#endif
    #pragma omp target teams distribute parallel for num_teams(gridsize) thread_limit(blocksize) \     map(to: inputvol) map(to: inputvol.vol[0:inputvol.dimxyzt]) map(tofrom: outputvol) map(tofrom: outputvol.vol[0:outputvol.dimxyzt]) \     map(to: pos) map(to: dir) map(to: seeds) map(to: gcfg) map(to: prop[0:gcfg.mediumnum]) reduction(+ : energyescape) firstprivate(ran, p)
    for (uint64_t i = 0; i < nphoton; i++) {
        ran.reseed(seeds.x ^ i, seeds.y | i, seeds.z ^ i, seeds.w | i);
        p.launch(pos, dir);
        p.run(inputvol, outputvol, prop, ran, gcfg);
        energyescape += p.pos.w;
    }
|


So far, I've only tested this on Linux (ubuntu 22.04, with g++-11/12/13). Although the |g++ -foffload=nvptx-none| produced binary can run successfully and correctly on two of my Linux boxes, both of them are about *50x slower* than the nvc++ compiled binary on the same GPU, which is 2x slower than my original CUDA based simulator.

Using nsight-compute (ncu), I noticed two major difference between g++/nvptx offloaded simulation vs nvc++ offloaded version:

for the nvc++ compiled binary, the thread configurations are shown in the below table,

|*    nvkernel_main_F1L451_2 (3125, 1, 1)x(64, 1, 1), Context 1, Stream 13, Device 0, CC 7.5*
    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                    64
    Function Cache Configuration CachePreferNone
*    Grid Size                                                  3,125*
*    Registers Per Thread register/thread              66*
    Shared Memory Configuration Size           Kbyte 32.77
    Driver Shared Memory Per Block byte/block               0
    Dynamic Shared Memory Per Block byte/block               0
    Static Shared Memory Per Block byte/block               8
    # SMs SM              48
    Threads                                   thread 200,000
    Uses Green Context                                             0
    Waves Per SM 4.65
    -------------------------------- --------------- ---------------|
|    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           14
    Block Limit Shared Mem                block          128
    Block Limit Warps                     block           16
    Theoretical Active Warps per SM        warp           28
    Theoretical Occupancy                     %        87.50
*    Achieved Occupancy                        %        78.43*
    Achieved Active Warps Per SM           warp        25.10
    ------------------------------- ----------- ------------
|

in comparison, g++-12 compiled binary results in the below stats

|*    main$_omp_fn$0 (288, 1, 1)x(32, 2, 1), Context 1, Stream 7, Device 0, CC 7.5*
    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                    64
    Function Cache Configuration CachePreferNone
*    Grid Size                                                    288*
*    Registers Per Thread register/thread             151*
    Shared Memory Configuration Size           Kbyte 32.77
    Driver Shared Memory Per Block byte/block               0
    Dynamic Shared Memory Per Block byte/block               0
    Static Shared Memory Per Block        byte/block 396
    # SMs SM              48
    Threads                                   thread 18,432
    Uses Green Context                                             0
    Waves Per SM                                                   1
    -------------------------------- --------------- ---------------

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block            6
    Block Limit Shared Mem                block           64
    Block Limit Warps                     block           16
    Theoretical Active Warps per SM        warp           12
    Theoretical Occupancy                     %        37.50
*    Achieved Occupancy                        %        31.01*
    Achieved Active Warps Per SM           warp         9.92
    ------------------------------- ----------- ------------|

One of the major differences is that g++ compiled binary ignored my |num_teams(gridsize)| setting and launched about 11x less teams/blocks than the desired block size (20000/64=3125). For my particular workload, I found in the past at least tens of thousands of threads is needed to maximize the occupancy.

*so my first question* is - how do I force g++ to use my desired team size? I tried both |num_teams| and setting |OMP_NUM_TEAMS|, but g++ somehow decided to use a 288 teams size.


The second major difference is that g++ compiled binary uses 151 registers compared to 66 produced by nvc++. The higher register count limits the maximum parallel blocks that can run on an SM.

*my second question *is - is there a flag to reduce such register use? if you manually count the variables used in the two classes performing the simulation, |MCX_rand ran| and |MCX_photon p| , it should not cost that many registers.

on a side note, g++-13 can compile the binary, but when running it, it produced an error, so I can't tell if it has a different behavior

   |libgomp: cuMemGetAddressRange_v2 error: named symbol not found
   libgomp: Copying of host object [0x7ffee93eeb18..0x7ffee93eeb20) to
   dev object [0x7fb1e84b5010..0x7fb1e84b5018) failed|


*my 3rd question *is regarding using prinf() inside the omp target section. the documentation shows (https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html) that prinf() should work, but when I build the binary. I got an "unresolved symbol __printf_chk" error. I am wondering if there is any flag to enable printf()?


finally, for amdgcn, I was able to compile the binary using g++-12 and g++-13, however, running the binary report errors in both cases

for g++-12 produced amdgcn binary, the error is

   |libgomp: Offload data incompatible with GCN plugin (expected 3,
   received 2)
   libgomp: Cannot map target functions or variables (expected 1, have
   4294967295)|

for g++-13 produced amdgcn binary, the error is

   |Memory access fault by GPU node-1 (Agent handle: 0x629916042c00) on
   address 0x7ae82cef1000. Reason: Page not present or supervisor
   privilege.
   Aborted (core dumped)|

the testing machine has 3x AMD GPUs: 7900 GRE, 5600XT, and Radeon VII (|gfx906|), running Ubuntu 22.04 and rocm 6.


If you want to reproduce these issues, here are the commands

   |git clone https://github.com/fangq/umcx.git
   cd umcx/src
   make nvidia CXX=g++-12   # nvptx offloading, need nvidia gpu
   ./umcx cube60    # or ncu ./umcx cube60 to call nsight-compute
   |

   |make nvc    # use nvc++ to build the binary, need nvidia gpu
   ./umcx cube60  # or ncu ./umcx cube60|

   |make amd CXX=g++-12    # amdgcn offloading, need gfx906 gpu
   ./umcx cube60
   |


any suggestions on how to get around these issues would be greatly appreciated!

thanks


Qianqian





[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux