Re: [PATCH 2/2] drm/amdkfd: Allow memory oversubscription on small APUs

Felix Kuehling <felix.kuehling@xxxxxxx> · Mon, 29 Apr 2024 14:39:30 -0400

On 2024-04-29 06:38, Yu, Lang wrote:
[Public]

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Saturday, April 27, 2024 6:45 AM
To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Yang, Philip <Philip.Yang@xxxxxxx>; Koenig, Christian
<Christian.Koenig@xxxxxxx>; Zhang, Yifan <Yifan1.Zhang@xxxxxxx>; Liu,
Aaron <Aaron.Liu@xxxxxxx>
Subject: Re: [PATCH 2/2] drm/amdkfd: Allow memory oversubscription on
small APUs

On 2024-04-26 04:37, Lang Yu wrote:
The default ttm_tt_pages_limit is 1/2 of system memory.
It is prone to out of memory with such a configuration.
Indiscriminately allowing the violation of all memory limits is not a good
solution. It will lead to poor performance once you actually reach
ttm_pages_limit and TTM starts swapping out BOs.
Hi Felix,

I just feel it's like a bug that 1/2 of system memory is fee, the driver tells users out of memory.
On the other hand, if memory is available, why not use it.

TTM does not allow us to use more than 1/2 system memory. I believe 
that's because TTM needs additional memory to swap out BOs. Any GTT 
allocation through the render node APIs is subject to the same limitations.

Render node APIs can handle memory overcommitment more gracefully 
because the kernel mode driver is in the loop for command submissions 
and fences. That doesn't work for KFD with user mode queues. The memory 
limits in KFD are there to prevent overcommitting memory because we need 
all of our memory (per process) to be resident at the same time. If we 
let KFD exceed the TTM limits, we get into situations where we're 
thrashing (processes evicting each other constantly) or even worse, 
where we're just not able to make all memory resident. So we end up with 
suspended user mode queues and extremely poor performance or soft hangs.



By the way, can we use USERPTR for VRAM allocations?
Then we don't have ttm_tt_pages_limit limitations. Thanks.

No. There is an expectation that VRAM BOs can be shared between 
processes through DMABufs (for HIP IPC APIs). You can't export userptrs 
as DMABufs.

You can try to raise the TTM pages limit using a TTM module parameter. 
But this is taking a risk for system stability when TTM gets into a 
situation where it needs to swap out a large BO.

Regards,
  Felix



I actually did some tests on Strix (12 CU@2100 MHz, 29412M 128bits LPDDR5@937MHz) with
https://github.com/ROCm/pytorch-micro-benchmarking.

Command: python micro_benchmarking_pytorch.py --network resnet50 --batch-size=64 --iterations=20

1, Run 1 resnet50 (FP32, batch size 64)
Memory usage:
         System mem used 6748M out of 29412M
         TTM mem used 6658M out of 15719M
Memory oversubscription percentage:  0
Throughput [img/sec] : 49.04

2,  Run 2 resnet50 simultaneously (FP32, batch size 64)
Memory usage:
         System mem used 13496M out of 29412M
         TTM mem used 13316M out of 15719M
Memory oversubscription percentage:  0
Throughput [img/sec] (respectively) : 25.27 / 26.70

3, Run 3 resnet50 simultaneously (FP32, batch size 64)
Memory usage:
         System mem used 20245M out of 29412M
         TTM mem used 19974M out of 15719M
Memory oversubscription percentage:  ~27%

Throughput [img/sec](respectively) : 10.62 / 7.47 / 6.90 (In theory: 16 / 16 / 16)

 From my observations,

1, GPU is underutilized a lot, sometimes its loading is less than 50% and even 0, when running 3 resnet50 simultaneously with ~27% memory oversubscription.
The driver is busying evicting and restoring process. It takes ~2-5 seconds to restore all the BOs for one process (swap in and out BOs, actually allocate and copy pages),
even though the process doesn't need all the allocated BOs to be resident.

2, Sometimes, the fairness can't be guaranteed between process when memory is oversubscribed.
They can't share the GPU equally when created with default priority.

3, The less GPU underutilization time during evicting and restoring, the less performance degradation under memory oversubscription.

Regards,
Lang

Regards,
   Felix


Signed-off-by: Lang Yu <Lang.Yu@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c       |  2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h       |  4 ++--
   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 12
+++++++++---
   3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 3295838e9a1d..c01c6f3ab562 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -167,7 +167,7 @@ void amdgpu_amdkfd_device_init(struct
amdgpu_device *adev)
      int i;
      int last_valid_bit;

-    amdgpu_amdkfd_gpuvm_init_mem_limits();
+    amdgpu_amdkfd_gpuvm_init_mem_limits(adev);

      if (adev->kfd.dev) {
              struct kgd2kfd_shared_resources gpu_resources = { diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index 1de021ebdd46..13284dbd8c58 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -363,7 +363,7 @@ u64 amdgpu_amdkfd_xcp_memory_size(struct
amdgpu_device *adev, int xcp_id);


   #if IS_ENABLED(CONFIG_HSA_AMD)
-void amdgpu_amdkfd_gpuvm_init_mem_limits(void);
+void amdgpu_amdkfd_gpuvm_init_mem_limits(struct amdgpu_device
*adev);
   void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
                              struct amdgpu_vm *vm);

@@ -376,7 +376,7 @@ void amdgpu_amdkfd_release_notify(struct
amdgpu_bo *bo);
   void amdgpu_amdkfd_reserve_system_mem(uint64_t size);
   #else
   static inline
-void amdgpu_amdkfd_gpuvm_init_mem_limits(void)
+void amdgpu_amdkfd_gpuvm_init_mem_limits(struct amdgpu_device
*adev)
   {
   }

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 7eb5afcc4895..a3e623a320b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -60,6 +60,7 @@ static struct {
      int64_t system_mem_used;
      int64_t ttm_mem_used;
      spinlock_t mem_limit_lock;
+    bool alow_oversubscribe;
   } kfd_mem_limit;

   static const char * const domain_bit_to_string[] = { @@ -110,7
+111,7 @@ static bool reuse_dmamap(struct amdgpu_device *adev, struct
amdgpu_device *bo_ad
    *  System (TTM + userptr) memory - 15/16th System RAM
    *  TTM memory - 3/8th System RAM
    */
-void amdgpu_amdkfd_gpuvm_init_mem_limits(void)
+void amdgpu_amdkfd_gpuvm_init_mem_limits(struct amdgpu_device
*adev)
   {
      struct sysinfo si;
      uint64_t mem;
@@ -130,6 +131,7 @@ void
amdgpu_amdkfd_gpuvm_init_mem_limits(void)
              kfd_mem_limit.max_system_mem_limit -=
AMDGPU_RESERVE_MEM_LIMIT;
      kfd_mem_limit.max_ttm_mem_limit = ttm_tt_pages_limit() <<
PAGE_SHIFT;
+    kfd_mem_limit.alow_oversubscribe = !!(adev->flags & AMD_IS_APU);
      pr_debug("Kernel memory limit %lluM, TTM limit %lluM\n",
              (kfd_mem_limit.max_system_mem_limit >> 20),
              (kfd_mem_limit.max_ttm_mem_limit >> 20)); @@ -221,8
+223,12 @@ int
amdgpu_amdkfd_reserve_mem_limit(struct amdgpu_device *adev,
           kfd_mem_limit.max_ttm_mem_limit) ||
          (adev && xcp_id >= 0 && adev->kfd.vram_used[xcp_id] +
vram_needed >
           vram_size - reserved_for_pt - atomic64_read(&adev-
vram_pin_size))) {
-            ret = -ENOMEM;
-            goto release;
+            if (kfd_mem_limit.alow_oversubscribe) {
+                    pr_warn_ratelimited("Memory is getting
oversubscried.\n");
+            } else {
+                    ret = -ENOMEM;
+                    goto release;
+            }
      }

      /* Update memory accounting by decreasing available system