Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Sat, 11 Jan 2025 09:06:51 +0530

On 1/11/2025 2:53 AM, Kim, Jonathan wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>> Sent: Friday, January 10, 2025 11:29 AM
>> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset
>>
>>
>>
>> On 1/10/2025 9:43 PM, Kim, Jonathan wrote:
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>>>> Sent: Thursday, January 9, 2025 10:39 PM
>>>> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset
>>>>
>>>>
>>>>
>>>> On 1/9/2025 8:27 PM, Kim, Jonathan wrote:
>>>>> [Public]
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>>>>>> Sent: Thursday, January 9, 2025 1:14 AM
>>>>>> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
>>>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue
>> reset
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1/9/2025 1:31 AM, Jonathan Kim wrote:
>>>>>>> Per queue reset should be bypassed when gpu recovery is disabled
>>>>>>> with module parameter.
>>>>>>>
>>>>>>> Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++++++
>>>>>>>  1 file changed, 6 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>> index cc66ebb7bae1..441568163e20 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>> @@ -1131,6 +1131,9 @@ uint64_t kgd_gfx_v9_hqd_get_pq_addr(struct
>>>>>> amdgpu_device *adev,
>>>>>>>     uint32_t low, high;
>>>>>>>     uint64_t queue_addr = 0;
>>>>>>>
>>>>>>> +   if (!amdgpu_gpu_recovery)
>>>>>>> +           return 0;
>>>>>>> +
>>>>>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>>>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>>>>>
>>>>>>> @@ -1179,6 +1182,9 @@ uint64_t kgd_gfx_v9_hqd_reset(struct
>>>> amdgpu_device
>>>>>> *adev,
>>>>>>>     uint32_t low, high, pipe_reset_data = 0;
>>>>>>>     uint64_t queue_addr = 0;
>>>>>>>
>>>>>>> +   if (!amdgpu_gpu_recovery)
>>>>>>> +           return 0;
>>>>>>> +
>>>>>>
>>>>>> I think the right place for this check is not inside callback, should be
>>>>>> from the place where the callback gets called.
>>>>>
>>>>> I don't think it really matters.  We're going to have different reset types in the
>> future
>>>> that my come from different callers.
>>>>> It's probably easier to remember to put the bypass where the reset is actually
>>>> happening.
>>>>>
>>>>
>>>> If I want to define something like amdgpu_gpu_recovery=2 (don't do queue
>>>> reset but perform other resets), then it matters.
>>>
>>> I don't get why that matters.
>>> This callback alone, for example, calls 2 types of resets within itself (queue then
>> pipe).
>>> If you wanted partial resets between queue and pipe in this case, you'd have to
>> branch test within the callback itself.
>>> GPU reset bypass checks are invisible to the KFD section of the code as well.
>>>
>>>>
>>>> Since this is a callback, keeping it at the wrapper place may be more
>>>> maintainable rather than keeping the check for gfx10/11/12 etc.
>>>
>>> Maybe not.  MES is preemption checks are not like MEC preemption checks at all.
>>> And we probably don't want to jam other future IP resets into a single caller.
>>> If you look at the kfd2kgd callbacks, most are pretty much copy and paste from one
>> generation to another.
>>> I don't see how putting the test in the callback makes it less maintainable.
>>>
>>
>> My thought process was this could be put in - reset_queues_on_hws_hang
>> and anything similar handles MES based queue resets. What you are saying
>> there won't be anything like reset_queue_by_mes() for MES based resets.
>> Is that correct?
> 
> No the opposite.  But now we'd have to remember to put it in 2 places where there's still no visible test for gpu reset bypass in the same file.
> SDMA resets are also being implemented and will probably have to be called in different places in the KFD as well.
> We can look at consolidating this later as more devices and IPs get done if it makes sense to abstract this stuff.
> My point is, the callback does the reset and returns a reset status.
> Bypassing by fail return seems easier to remember and leverage.

Ok, we have SDMA queue reset called from job timeouts. If it's getting
called from KFD too, will look at consolidating that one.

BTW, if this is returning a fake success, won't it result in a print
like queue reset succeeded which gives the false impression that queue
reset happened? Or, should it return a different error code like
'ECANCELED' since operation is intentionally skipped through module param

Thanks,
Lijo

> That being said, putting the test in hqd_get_pq_addr was probably overkill, but I don't think anyone really cares to use it with gpu recovery off on its own at the moment.
> 
> Jon
> 
>>
>> Thanks,
>> Lijo
>>
>>> Jon
>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Jon
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>>>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>>>>>
>>>>>
>>>
>