Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Fri, 10 Jan 2025 21:59:18 +0530

On 1/10/2025 9:43 PM, Kim, Jonathan wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>> Sent: Thursday, January 9, 2025 10:39 PM
>> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset
>>
>>
>>
>> On 1/9/2025 8:27 PM, Kim, Jonathan wrote:
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>>>> Sent: Thursday, January 9, 2025 1:14 AM
>>>> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset
>>>>
>>>>
>>>>
>>>> On 1/9/2025 1:31 AM, Jonathan Kim wrote:
>>>>> Per queue reset should be bypassed when gpu recovery is disabled
>>>>> with module parameter.
>>>>>
>>>>> Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
>>>>> ---
>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++++++
>>>>>  1 file changed, 6 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>> index cc66ebb7bae1..441568163e20 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>> @@ -1131,6 +1131,9 @@ uint64_t kgd_gfx_v9_hqd_get_pq_addr(struct
>>>> amdgpu_device *adev,
>>>>>     uint32_t low, high;
>>>>>     uint64_t queue_addr = 0;
>>>>>
>>>>> +   if (!amdgpu_gpu_recovery)
>>>>> +           return 0;
>>>>> +
>>>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>>>
>>>>> @@ -1179,6 +1182,9 @@ uint64_t kgd_gfx_v9_hqd_reset(struct
>> amdgpu_device
>>>> *adev,
>>>>>     uint32_t low, high, pipe_reset_data = 0;
>>>>>     uint64_t queue_addr = 0;
>>>>>
>>>>> +   if (!amdgpu_gpu_recovery)
>>>>> +           return 0;
>>>>> +
>>>>
>>>> I think the right place for this check is not inside callback, should be
>>>> from the place where the callback gets called.
>>>
>>> I don't think it really matters.  We're going to have different reset types in the future
>> that my come from different callers.
>>> It's probably easier to remember to put the bypass where the reset is actually
>> happening.
>>>
>>
>> If I want to define something like amdgpu_gpu_recovery=2 (don't do queue
>> reset but perform other resets), then it matters.
> 
> I don't get why that matters.
> This callback alone, for example, calls 2 types of resets within itself (queue then pipe).
> If you wanted partial resets between queue and pipe in this case, you'd have to branch test within the callback itself.
> GPU reset bypass checks are invisible to the KFD section of the code as well.
> 
>>
>> Since this is a callback, keeping it at the wrapper place may be more
>> maintainable rather than keeping the check for gfx10/11/12 etc.
> 
> Maybe not.  MES is preemption checks are not like MEC preemption checks at all.
> And we probably don't want to jam other future IP resets into a single caller.
> If you look at the kfd2kgd callbacks, most are pretty much copy and paste from one generation to another.
> I don't see how putting the test in the callback makes it less maintainable.
> 

My thought process was this could be put in - reset_queues_on_hws_hang
and anything similar handles MES based queue resets. What you are saying
there won't be anything like reset_queue_by_mes() for MES based resets.
Is that correct?

Thanks,
Lijo

> Jon
> 
>>
>> Thanks,
>> Lijo
>>
>>> Jon
>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>>>
>>>
>