Dear list,
Happy new year!
I spent the past few days trying to port a previously developed CUDA
photon simulator (https://github.com/fangq/mcx) to OpenMP 5.x.
After receiving some help from MatColgrove@NVIDIA forum
(https://forums.developer.nvidia.com/t/how-to-use-openmp-map-directive-to-map-dynamic-array-inside-a-struct-class-to-the-gpu/318270),
I was able to successfully compile the code using both g++-12 with
nvptx-none offloading as well as NVIDIA nvc++. I would like to get some
help from this list as I noticed that there is a dramatic (50x) speed
difference between g++ offloaded binary vs nvc++ compiled version.
The working code (~400 lines) can be accessed at
https://github.com/fangq/umcx/blob/main/src/umcx.cpp, specifically, the
core of the code is an omp parallel for section shown below
|#ifdef _LIBGOMP_OMP_LOCK_DEFINED
const int gridsize = 200000 / 64, blocksize = 2; // gcc nvptx
offloading uses {32,teams_thread_limit,1} as blockdim
#else
const int gridsize = 200000 / 64, blocksize = 64; // nvc uses
{num_teams,1,1} as griddim and {teams_thread_limit,1,1} as blockdim
#endif
#pragma omp target teams distribute parallel for
num_teams(gridsize) thread_limit(blocksize) \
map(to: inputvol) map(to: inputvol.vol[0:inputvol.dimxyzt])
map(tofrom: outputvol) map(tofrom: outputvol.vol[0:outputvol.dimxyzt]) \
map(to: pos) map(to: dir) map(to: seeds) map(to: gcfg) map(to:
prop[0:gcfg.mediumnum]) reduction(+ : energyescape) firstprivate(ran, p)
for (uint64_t i = 0; i < nphoton; i++) {
ran.reseed(seeds.x ^ i, seeds.y | i, seeds.z ^ i, seeds.w | i);
p.launch(pos, dir);
p.run(inputvol, outputvol, prop, ran, gcfg);
energyescape += p.pos.w;
}
|
So far, I've only tested this on Linux (ubuntu 22.04, with
g++-11/12/13). Although the |g++ -foffload=nvptx-none| produced binary
can run successfully and correctly on two of my Linux boxes, both of
them are about *50x slower* than the nvc++ compiled binary on the same
GPU, which is 2x slower than my original CUDA based simulator.
Using nsight-compute (ncu), I noticed two major difference between
g++/nvptx offloaded simulation vs nvc++ offloaded version:
for the nvc++ compiled binary, the thread configurations are shown in
the below table,
|* nvkernel_main_F1L451_2 (3125, 1, 1)x(64, 1, 1), Context 1, Stream
13, Device 0, CC 7.5*
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 64
Function Cache Configuration CachePreferNone
* Grid Size 3,125*
* Registers Per Thread register/thread 66*
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 8
# SMs SM 48
Threads thread 200,000
Uses Green Context 0
Waves Per SM 4.65
-------------------------------- --------------- ---------------|
| Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 14
Block Limit Shared Mem block 128
Block Limit Warps block 16
Theoretical Active Warps per SM warp 28
Theoretical Occupancy % 87.50
* Achieved Occupancy % 78.43*
Achieved Active Warps Per SM warp 25.10
------------------------------- ----------- ------------
|
in comparison, g++-12 compiled binary results in the below stats
|* main$_omp_fn$0 (288, 1, 1)x(32, 2, 1), Context 1, Stream 7, Device
0, CC 7.5*
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 64
Function Cache Configuration CachePreferNone
* Grid Size 288*
* Registers Per Thread register/thread 151*
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 396
# SMs SM 48
Threads thread 18,432
Uses Green Context 0
Waves Per SM 1
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 6
Block Limit Shared Mem block 64
Block Limit Warps block 16
Theoretical Active Warps per SM warp 12
Theoretical Occupancy % 37.50
* Achieved Occupancy % 31.01*
Achieved Active Warps Per SM warp 9.92
------------------------------- ----------- ------------|
One of the major differences is that g++ compiled binary ignored my
|num_teams(gridsize)| setting and launched about 11x less teams/blocks
than the desired block size (20000/64=3125). For my particular workload,
I found in the past at least tens of thousands of threads is needed to
maximize the occupancy.
*so my first question* is - how do I force g++ to use my desired team
size? I tried both |num_teams| and setting |OMP_NUM_TEAMS|, but g++
somehow decided to use a 288 teams size.
The second major difference is that g++ compiled binary uses 151
registers compared to 66 produced by nvc++. The higher register count
limits the maximum parallel blocks that can run on an SM.
*my second question *is - is there a flag to reduce such register use?
if you manually count the variables used in the two classes performing
the simulation, |MCX_rand ran| and |MCX_photon p| , it should not cost
that many registers.
on a side note, g++-13 can compile the binary, but when running it, it
produced an error, so I can't tell if it has a different behavior
|libgomp: cuMemGetAddressRange_v2 error: named symbol not found
libgomp: Copying of host object [0x7ffee93eeb18..0x7ffee93eeb20) to
dev object [0x7fb1e84b5010..0x7fb1e84b5018) failed|
*my 3rd question *is regarding using prinf() inside the omp target
section. the documentation shows
(https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html) that prinf() should
work, but when I build the binary. I got an "unresolved symbol
__printf_chk" error. I am wondering if there is any flag to enable printf()?
finally, for amdgcn, I was able to compile the binary using g++-12 and
g++-13, however, running the binary report errors in both cases
for g++-12 produced amdgcn binary, the error is
|libgomp: Offload data incompatible with GCN plugin (expected 3,
received 2)
libgomp: Cannot map target functions or variables (expected 1, have
4294967295)|
for g++-13 produced amdgcn binary, the error is
|Memory access fault by GPU node-1 (Agent handle: 0x629916042c00) on
address 0x7ae82cef1000. Reason: Page not present or supervisor
privilege.
Aborted (core dumped)|
the testing machine has 3x AMD GPUs: 7900 GRE, 5600XT, and Radeon VII
(|gfx906|), running Ubuntu 22.04 and rocm 6.
If you want to reproduce these issues, here are the commands
|git clone https://github.com/fangq/umcx.git
cd umcx/src
make nvidia CXX=g++-12 # nvptx offloading, need nvidia gpu
./umcx cube60 # or ncu ./umcx cube60 to call nsight-compute
|
|make nvc # use nvc++ to build the binary, need nvidia gpu
./umcx cube60 # or ncu ./umcx cube60|
|make amd CXX=g++-12 # amdgcn offloading, need gfx906 gpu
./umcx cube60
|
any suggestions on how to get around these issues would be greatly
appreciated!
thanks
Qianqian