OpenMP nvptx and amdgcn offloading - questions on speed, grid size and printf

Qianqian Fang via Gcc-help <gcc-help@xxxxxxxxxxx> · Thu, 2 Jan 2025 01:06:56 -0500

Dear list,

Happy new year!

I spent the past few days trying to port a previously developed CUDA 
photon simulator (https://github.com/fangq/mcx) to OpenMP 5.x.

After receiving some help from MatColgrove@NVIDIA forum 
(https://forums.developer.nvidia.com/t/how-to-use-openmp-map-directive-to-map-dynamic-array-inside-a-struct-class-to-the-gpu/318270), 
I was able to successfully compile the code using both g++-12 with 
nvptx-none offloading as well as NVIDIA nvc++. I would like to get some 
help from this list as I noticed that there is a dramatic (50x) speed 
difference between g++ offloaded binary vs nvc++ compiled version.

The working code (~400 lines) can be accessed at 
https://github.com/fangq/umcx/blob/main/src/umcx.cpp, specifically, the 
core of the code is an omp parallel for section shown below

|#ifdef _LIBGOMP_OMP_LOCK_DEFINED
    const int gridsize = 200000 / 64, blocksize = 2;  // gcc nvptx 
offloading uses {32,teams_thread_limit,1} as blockdim
#else
    const int gridsize = 200000 / 64, blocksize = 64; // nvc uses 
{num_teams,1,1} as griddim and {teams_thread_limit,1,1} as blockdim
#endif
    #pragma omp target teams distribute parallel for 
num_teams(gridsize) thread_limit(blocksize) \
    map(to: inputvol) map(to: inputvol.vol[0:inputvol.dimxyzt]) 
map(tofrom: outputvol) map(tofrom: outputvol.vol[0:outputvol.dimxyzt]) \
    map(to: pos) map(to: dir) map(to: seeds) map(to: gcfg) map(to: 
prop[0:gcfg.mediumnum]) reduction(+ : energyescape) firstprivate(ran, p)
    for (uint64_t i = 0; i < nphoton; i++) {
        ran.reseed(seeds.x ^ i, seeds.y | i, seeds.z ^ i, seeds.w | i);
        p.launch(pos, dir);
        p.run(inputvol, outputvol, prop, ran, gcfg);
        energyescape += p.pos.w;
    }
|

So far, I've only tested this on Linux (ubuntu 22.04, with 
g++-11/12/13). Although the |g++ -foffload=nvptx-none| produced binary 
can run successfully and correctly on two of my Linux boxes, both of 
them are about *50x slower* than the nvc++ compiled binary on the same 
GPU, which is 2x slower than my original CUDA based simulator.

Using nsight-compute (ncu), I noticed two major difference between 
g++/nvptx offloaded simulation vs nvc++ offloaded version:

for the nvc++ compiled binary, the thread configurations are shown in 
the below table,

|*    nvkernel_main_F1L451_2 (3125, 1, 1)x(64, 1, 1), Context 1, Stream 
13, Device 0, CC 7.5*
    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                    64
    Function Cache Configuration CachePreferNone
*    Grid Size                                                  3,125*
*    Registers Per Thread register/thread              66*
    Shared Memory Configuration Size           Kbyte 32.77
    Driver Shared Memory Per Block byte/block               0
    Dynamic Shared Memory Per Block byte/block               0
    Static Shared Memory Per Block byte/block               8
    # SMs SM              48
    Threads                                   thread 200,000
    Uses Green Context                                             0
    Waves Per SM 4.65
    -------------------------------- --------------- ---------------|
|    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block           14
    Block Limit Shared Mem                block          128
    Block Limit Warps                     block           16
    Theoretical Active Warps per SM        warp           28
    Theoretical Occupancy                     %        87.50
*    Achieved Occupancy                        %        78.43*
    Achieved Active Warps Per SM           warp        25.10
    ------------------------------- ----------- ------------
|

in comparison, g++-12 compiled binary results in the below stats

|*    main$_omp_fn$0 (288, 1, 1)x(32, 2, 1), Context 1, Stream 7, Device 
0, CC 7.5*
    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                    64
    Function Cache Configuration CachePreferNone
*    Grid Size                                                    288*
*    Registers Per Thread register/thread             151*
    Shared Memory Configuration Size           Kbyte 32.77
    Driver Shared Memory Per Block byte/block               0
    Dynamic Shared Memory Per Block byte/block               0
    Static Shared Memory Per Block        byte/block 396
    # SMs SM              48
    Threads                                   thread 18,432
    Uses Green Context                                             0
    Waves Per SM                                                   1
    -------------------------------- --------------- ---------------

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           16
    Block Limit Registers                 block            6
    Block Limit Shared Mem                block           64
    Block Limit Warps                     block           16
    Theoretical Active Warps per SM        warp           12
    Theoretical Occupancy                     %        37.50
*    Achieved Occupancy                        %        31.01*
    Achieved Active Warps Per SM           warp         9.92
    ------------------------------- ----------- ------------|

One of the major differences is that g++ compiled binary ignored my 
|num_teams(gridsize)| setting and launched about 11x less teams/blocks 
than the desired block size (20000/64=3125). For my particular workload, 
I found in the past at least tens of thousands of threads is needed to 
maximize the occupancy.

*so my first question* is - how do I force g++ to use my desired team 
size? I tried both |num_teams| and setting |OMP_NUM_TEAMS|, but g++ 
somehow decided to use a 288 teams size.

The second major difference is that g++ compiled binary uses 151 
registers compared to 66 produced by nvc++. The higher register count 
limits the maximum parallel blocks that can run on an SM.

*my second question *is - is there a flag to reduce such register use? 
if you manually count the variables used in the two classes performing 
the simulation, |MCX_rand ran| and |MCX_photon p| , it should not cost 
that many registers.

on a side note, g++-13 can compile the binary, but when running it, it 
produced an error, so I can't tell if it has a different behavior

   |libgomp: cuMemGetAddressRange_v2 error: named symbol not found
   libgomp: Copying of host object [0x7ffee93eeb18..0x7ffee93eeb20) to
   dev object [0x7fb1e84b5010..0x7fb1e84b5018) failed|

*my 3rd question *is regarding using prinf() inside the omp target 
section. the documentation shows 
(https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html) that prinf() should 
work, but when I build the binary. I got an "unresolved symbol 
__printf_chk" error. I am wondering if there is any flag to enable printf()?

finally, for amdgcn, I was able to compile the binary using g++-12 and 
g++-13, however, running the binary report errors in both cases

for g++-12 produced amdgcn binary, the error is

   |libgomp: Offload data incompatible with GCN plugin (expected 3,
   received 2)
   libgomp: Cannot map target functions or variables (expected 1, have
   4294967295)|

for g++-13 produced amdgcn binary, the error is

   |Memory access fault by GPU node-1 (Agent handle: 0x629916042c00) on
   address 0x7ae82cef1000. Reason: Page not present or supervisor
   privilege.
   Aborted (core dumped)|

the testing machine has 3x AMD GPUs: 7900 GRE, 5600XT, and Radeon VII 
(|gfx906|), running Ubuntu 22.04 and rocm 6.

If you want to reproduce these issues, here are the commands

   |git clone https://github.com/fangq/umcx.git
   cd umcx/src
   make nvidia CXX=g++-12   # nvptx offloading, need nvidia gpu
   ./umcx cube60    # or ncu ./umcx cube60 to call nsight-compute
   |

   |make nvc    # use nvc++ to build the binary, need nvidia gpu
   ./umcx cube60  # or ncu ./umcx cube60|

   |make amd CXX=g++-12    # amdgcn offloading, need gfx906 gpu
   ./umcx cube60
   |

any suggestions on how to get around these issues would be greatly 
appreciated!

thanks

Qianqian