RE: [PATCH] drm/amdkfd: keep create queue success if cwsr save area doesn't match

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - AMD Internal Distribution Only]

Got it. Will address this issue in ROCr.

Best Regards,
Yifan

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Wednesday, August 14, 2024 10:47 PM
To: Zhang, Yifan <Yifan1.Zhang@xxxxxxx>; Christopher Snowhill <chris@xxxxxxxxxx>
Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Philip <Philip.Yang@xxxxxxx>
Subject: Re: [PATCH] drm/amdkfd: keep create queue success if cwsr save area doesn't match



On 2024-08-14 2:35, Zhang, Yifan wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> AFAIK, for low level libraries, e.g. LLVM, ROCr, Hip/OpenCL runtimes, all GPUs are supported. But for the mathlibs and frameworks, only limited GPUs are supported. E.g. :
>
> https://github.com/ROCm/rocBLAS/blob/28877e5e134a157b7ea56b88a1a12ba55
> 1d53cbf/CMakeLists.txt#L111
>
> set( TARGET_LIST_ROCM_6.3
> "gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a;gfx940;gfx941;gfx942;gfx101
> 0;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201")
>
> On the unsupported GPUs, HSA_OVERRIDE_GFX_VERSION currently works as a workaround.

Then HSA_OVERRIDE_GFX_VERSION probably does more than we need it to for working around support in the math libs. What math libs care about is mainly the ISA target version. There should be no need to allocate different sizes for CWSR areas.

Regards,
  Felix


>
>
> Best Regards,
> Yifan
>
> -----Original Message-----
> From: Christopher Snowhill <chris@xxxxxxxxxx>
> Sent: Wednesday, August 14, 2024 10:01 AM
> To: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
> Cc: Zhang, Yifan <Yifan1.Zhang@xxxxxxx>; Kasiviswanathan, Harish
> <Harish.Kasiviswanathan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang,
> Philip <Philip.Yang@xxxxxxx>
> Subject: Re: [PATCH] drm/amdkfd: keep create queue success if cwsr
> save area doesn't match
>
>
>
>> On Aug 13, 2024, at 6:52 PM, Felix Kuehling <felix.kuehling@xxxxxxx> wrote:
>>
>> Hang on a second. If there are production GPUs that only work with HSA_OVERRIDE_GFX_VERSION right now, then we should make those GPUs properly supported. I thought this was only used internally for bring-up or maybe externally as a short-term solution before we upstream proper support for new GPUs.
>
> For instance, for a bunch of compute things, I have to override 10.3.0 for my 10.3.1 GPU, a 6700 XT, because nobody builds or packages the kernels for 10.3.1.
>
>>
>> Regards,
>>  Felix
>>
>>
>>> On 2024-08-11 22:10, Zhang, Yifan wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> I agree that adding exp_hw_support is a safer approach. My concern is that HSA_OVERRIDE_GFX_VERSION has been used for a while and has become a status quo for running ROCm on unsupported APUs. I'm not sure if this approach will be a burden for APU end users. Adding driver load parameters is more complicated than simply adding an environment variable on consumer PCs.
>>>
>>> Best Regards,
>>> Yifan
>>>
>>> -----Original Message-----
>>> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
>>> Sent: Saturday, August 10, 2024 7:37 AM
>>> To: Zhang, Yifan <Yifan1.Zhang@xxxxxxx>; Kasiviswanathan, Harish
>>> <Harish.Kasiviswanathan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Yang, Philip <Philip.Yang@xxxxxxx>
>>> Subject: Re: [PATCH] drm/amdkfd: keep create queue success if cwsr
>>> save area doesn't match
>>>
>>> Maybe we can turn this check into a warnings if, and only if the exp_hw_support module param is set. That way we don't water down the checks on the production code path but allow experimental setups to run without a seat belt.
>>>
>>> Regards,
>>>   Felix
>>>
>>>
>>>> On 2024-08-09 01:39, Zhang, Yifan wrote:
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>
>>>> Yes, I think we need that change for a normal code path, but this case is introduced only with the HSA_OVERRIDE_GFX_VERSION environment setting, which implies that "the override ASIC is compatible with the real ASIC." It is intended for experimental purposes. When a user is using HSA_OVERRIDE_GFX_VERSION, they should be aware of the potential risks it may bring. Usually, HSA_OVERRIDE_GFX_VERSION is used to force an unsupported APU to be recognized as a ROCm-supported high-end dGPU, which has a large cwsr save area, making the operation safe. This check was added to KFD two weeks ago, the HSA_OVERRIDE_GFX_VERSION environment had been working fine before that.
>>>>
>>>> Best Regards,
>>>> Yifan
>>>>
>>>> -----Original Message-----
>>>> From: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
>>>> Sent: Thursday, August 8, 2024 10:46 PM
>>>> To: Zhang, Yifan <Yifan1.Zhang@xxxxxxx>;
>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Yang, Philip
>>>> <Philip.Yang@xxxxxxx>; Zhang, Yifan <Yifan1.Zhang@xxxxxxx>
>>>> Subject: RE: [PATCH] drm/amdkfd: keep create queue success if cwsr
>>>> save area doesn't match
>>>>
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>>
>>>> In this case, shouldn't larger of two sizes be used. Also, we should have an upper bound check.
>>>>
>>>> -----Original Message-----
>>>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
>>>> Yifan Zhang
>>>> Sent: Thursday, August 8, 2024 4:44 AM
>>>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Yang, Philip
>>>> <Philip.Yang@xxxxxxx>; Zhang, Yifan <Yifan1.Zhang@xxxxxxx>
>>>> Subject: [PATCH] drm/amdkfd: keep create queue success if cwsr save
>>>> area doesn't match
>>>>
>>>> If HSA_OVERRIDE_GFX_VERSION is used in ROCm workload, user space and kernel use different spec to calculate cwsr save area, current check may fail create queue ioctl. Change error to warn to make create queue succeed in that case.
>>>>
>>>> Signed-off-by: Yifan Zhang <yifan1.zhang@xxxxxxx>
>>>> ---
>>>>  drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 4 +---
>>>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>>>> index e0a073ae4a49..9f283aff057a 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c
>>>> @@ -295,11 +295,9 @@ int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_prope
>>>>         }
>>>>
>>>>         if (properties->ctx_save_restore_area_size != topo_dev->node_props.cwsr_size) {
>>>> -               pr_debug("queue cwsr size 0x%x not equal to node cwsr size 0x%x\n",
>>>> +               pr_warn("queue cwsr size 0x%x not equal to node
>>>> + cwsr size 0x%x\n",
>>>>                         properties->ctx_save_restore_area_size,
>>>>                         topo_dev->node_props.cwsr_size);
>>>> -               err = -EINVAL;
>>>> -               goto out_err_unreserve;
>>>>         }
>>>>
>>>>         total_cwsr_size = (topo_dev->node_props.cwsr_size +
>>>> topo_dev->node_props.debug_memory_size)
>>>> --
>>>> 2.37.3
>>>>
>>>>




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux