RE: [PATCH 2/2] drm/amdkfd: Allow memory oversubscription on small APUs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]

>-----Original Message-----
>From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
>Sent: Saturday, April 27, 2024 6:45 AM
>To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>Cc: Yang, Philip <Philip.Yang@xxxxxxx>; Koenig, Christian
><Christian.Koenig@xxxxxxx>; Zhang, Yifan <Yifan1.Zhang@xxxxxxx>; Liu,
>Aaron <Aaron.Liu@xxxxxxx>
>Subject: Re: [PATCH 2/2] drm/amdkfd: Allow memory oversubscription on
>small APUs
>
>On 2024-04-26 04:37, Lang Yu wrote:
>> The default ttm_tt_pages_limit is 1/2 of system memory.
>> It is prone to out of memory with such a configuration.
>Indiscriminately allowing the violation of all memory limits is not a good
>solution. It will lead to poor performance once you actually reach
>ttm_pages_limit and TTM starts swapping out BOs.

Hi Felix,

I just feel it's like a bug that 1/2 of system memory is fee, the driver tells users out of memory.
On the other hand, if memory is available, why not use it.

By the way, can we use USERPTR for VRAM allocations?
Then we don't have ttm_tt_pages_limit limitations. Thanks.

I actually did some tests on Strix (12 CU@2100 MHz, 29412M 128bits LPDDR5@937MHz) with
https://github.com/ROCm/pytorch-micro-benchmarking.

Command: python micro_benchmarking_pytorch.py --network resnet50 --batch-size=64 --iterations=20

1, Run 1 resnet50 (FP32, batch size 64)
Memory usage:
        System mem used 6748M out of 29412M
        TTM mem used 6658M out of 15719M
Memory oversubscription percentage:  0
Throughput [img/sec] : 49.04

2,  Run 2 resnet50 simultaneously (FP32, batch size 64)
Memory usage:
        System mem used 13496M out of 29412M
        TTM mem used 13316M out of 15719M
Memory oversubscription percentage:  0
Throughput [img/sec] (respectively) : 25.27 / 26.70

3, Run 3 resnet50 simultaneously (FP32, batch size 64)
Memory usage:
        System mem used 20245M out of 29412M
        TTM mem used 19974M out of 15719M
Memory oversubscription percentage:  ~27%

Throughput [img/sec](respectively) : 10.62 / 7.47 / 6.90 (In theory: 16 / 16 / 16)

>From my observations,

1, GPU is underutilized a lot, sometimes its loading is less than 50% and even 0, when running 3 resnet50 simultaneously with ~27% memory oversubscription.
The driver is busying evicting and restoring process. It takes ~2-5 seconds to restore all the BOs for one process (swap in and out BOs, actually allocate and copy pages),
even though the process doesn't need all the allocated BOs to be resident.

2, Sometimes, the fairness can't be guaranteed between process when memory is oversubscribed.
They can't share the GPU equally when created with default priority.

3, The less GPU underutilization time during evicting and restoring, the less performance degradation under memory oversubscription.

Regards,
Lang

>Regards,
>   Felix
>
>
>>
>> Signed-off-by: Lang Yu <Lang.Yu@xxxxxxx>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c       |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h       |  4 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 12
>+++++++++---
>>   3 files changed, 12 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> index 3295838e9a1d..c01c6f3ab562 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
>> @@ -167,7 +167,7 @@ void amdgpu_amdkfd_device_init(struct
>amdgpu_device *adev)
>>      int i;
>>      int last_valid_bit;
>>
>> -    amdgpu_amdkfd_gpuvm_init_mem_limits();
>> +    amdgpu_amdkfd_gpuvm_init_mem_limits(adev);
>>
>>      if (adev->kfd.dev) {
>>              struct kgd2kfd_shared_resources gpu_resources = { diff --git
>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> index 1de021ebdd46..13284dbd8c58 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
>> @@ -363,7 +363,7 @@ u64 amdgpu_amdkfd_xcp_memory_size(struct
>> amdgpu_device *adev, int xcp_id);
>>
>>
>>   #if IS_ENABLED(CONFIG_HSA_AMD)
>> -void amdgpu_amdkfd_gpuvm_init_mem_limits(void);
>> +void amdgpu_amdkfd_gpuvm_init_mem_limits(struct amdgpu_device
>*adev);
>>   void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
>>                              struct amdgpu_vm *vm);
>>
>> @@ -376,7 +376,7 @@ void amdgpu_amdkfd_release_notify(struct
>amdgpu_bo *bo);
>>   void amdgpu_amdkfd_reserve_system_mem(uint64_t size);
>>   #else
>>   static inline
>> -void amdgpu_amdkfd_gpuvm_init_mem_limits(void)
>> +void amdgpu_amdkfd_gpuvm_init_mem_limits(struct amdgpu_device
>*adev)
>>   {
>>   }
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 7eb5afcc4895..a3e623a320b3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -60,6 +60,7 @@ static struct {
>>      int64_t system_mem_used;
>>      int64_t ttm_mem_used;
>>      spinlock_t mem_limit_lock;
>> +    bool alow_oversubscribe;
>>   } kfd_mem_limit;
>>
>>   static const char * const domain_bit_to_string[] = { @@ -110,7
>> +111,7 @@ static bool reuse_dmamap(struct amdgpu_device *adev, struct
>amdgpu_device *bo_ad
>>    *  System (TTM + userptr) memory - 15/16th System RAM
>>    *  TTM memory - 3/8th System RAM
>>    */
>> -void amdgpu_amdkfd_gpuvm_init_mem_limits(void)
>> +void amdgpu_amdkfd_gpuvm_init_mem_limits(struct amdgpu_device
>*adev)
>>   {
>>      struct sysinfo si;
>>      uint64_t mem;
>> @@ -130,6 +131,7 @@ void
>amdgpu_amdkfd_gpuvm_init_mem_limits(void)
>>              kfd_mem_limit.max_system_mem_limit -=
>AMDGPU_RESERVE_MEM_LIMIT;
>>
>>      kfd_mem_limit.max_ttm_mem_limit = ttm_tt_pages_limit() <<
>> PAGE_SHIFT;
>> +    kfd_mem_limit.alow_oversubscribe = !!(adev->flags & AMD_IS_APU);
>>      pr_debug("Kernel memory limit %lluM, TTM limit %lluM\n",
>>              (kfd_mem_limit.max_system_mem_limit >> 20),
>>              (kfd_mem_limit.max_ttm_mem_limit >> 20)); @@ -221,8
>+223,12 @@ int
>> amdgpu_amdkfd_reserve_mem_limit(struct amdgpu_device *adev,
>>           kfd_mem_limit.max_ttm_mem_limit) ||
>>          (adev && xcp_id >= 0 && adev->kfd.vram_used[xcp_id] +
>vram_needed >
>>           vram_size - reserved_for_pt - atomic64_read(&adev-
>>vram_pin_size))) {
>> -            ret = -ENOMEM;
>> -            goto release;
>> +            if (kfd_mem_limit.alow_oversubscribe) {
>> +                    pr_warn_ratelimited("Memory is getting
>oversubscried.\n");
>> +            } else {
>> +                    ret = -ENOMEM;
>> +                    goto release;
>> +            }
>>      }
>>
>>      /* Update memory accounting by decreasing available system




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux