Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 1/15/2025 1:59 PM, Lazar, Lijo wrote:
> 
> 
> On 1/14/2025 10:51 PM, Kim, Jonathan wrote:
>> [Public]
>>
>>> -----Original Message-----
>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>>> Sent: Friday, January 10, 2025 10:37 PM
>>> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset
>>>
>>>
>>>
>>> On 1/11/2025 2:53 AM, Kim, Jonathan wrote:
>>>> [Public]
>>>>
>>>>> -----Original Message-----
>>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>>>>> Sent: Friday, January 10, 2025 11:29 AM
>>>>> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
>>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset
>>>>>
>>>>>
>>>>>
>>>>> On 1/10/2025 9:43 PM, Kim, Jonathan wrote:
>>>>>> [Public]
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>>>>>>> Sent: Thursday, January 9, 2025 10:39 PM
>>>>>>> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue
>>> reset
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 1/9/2025 8:27 PM, Kim, Jonathan wrote:
>>>>>>>> [Public]
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
>>>>>>>>> Sent: Thursday, January 9, 2025 1:14 AM
>>>>>>>>> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-
>>> gfx@xxxxxxxxxxxxxxxxxxxxx
>>>>>>>>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
>>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue
>>>>> reset
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 1/9/2025 1:31 AM, Jonathan Kim wrote:
>>>>>>>>>> Per queue reset should be bypassed when gpu recovery is disabled
>>>>>>>>>> with module parameter.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
>>>>>>>>>> ---
>>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++++++
>>>>>>>>>>  1 file changed, 6 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>>> index cc66ebb7bae1..441568163e20 100644
>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>>> @@ -1131,6 +1131,9 @@ uint64_t kgd_gfx_v9_hqd_get_pq_addr(struct
>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>     uint32_t low, high;
>>>>>>>>>>     uint64_t queue_addr = 0;
>>>>>>>>>>
>>>>>>>>>> +   if (!amdgpu_gpu_recovery)
>>>>>>>>>> +           return 0;
>>>>>>>>>> +
>>>>>>>>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>>>>>>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>>>>>>>>
>>>>>>>>>> @@ -1179,6 +1182,9 @@ uint64_t kgd_gfx_v9_hqd_reset(struct
>>>>>>> amdgpu_device
>>>>>>>>> *adev,
>>>>>>>>>>     uint32_t low, high, pipe_reset_data = 0;
>>>>>>>>>>     uint64_t queue_addr = 0;
>>>>>>>>>>
>>>>>>>>>> +   if (!amdgpu_gpu_recovery)
>>>>>>>>>> +           return 0;
>>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> I think the right place for this check is not inside callback, should be
>>>>>>>>> from the place where the callback gets called.
>>>>>>>>
>>>>>>>> I don't think it really matters.  We're going to have different reset types in the
>>>>> future
>>>>>>> that my come from different callers.
>>>>>>>> It's probably easier to remember to put the bypass where the reset is actually
>>>>>>> happening.
>>>>>>>>
>>>>>>>
>>>>>>> If I want to define something like amdgpu_gpu_recovery=2 (don't do queue
>>>>>>> reset but perform other resets), then it matters.
>>>>>>
>>>>>> I don't get why that matters.
>>>>>> This callback alone, for example, calls 2 types of resets within itself (queue then
>>>>> pipe).
>>>>>> If you wanted partial resets between queue and pipe in this case, you'd have to
>>>>> branch test within the callback itself.
>>>>>> GPU reset bypass checks are invisible to the KFD section of the code as well.
>>>>>>
>>>>>>>
>>>>>>> Since this is a callback, keeping it at the wrapper place may be more
>>>>>>> maintainable rather than keeping the check for gfx10/11/12 etc.
>>>>>>
>>>>>> Maybe not.  MES is preemption checks are not like MEC preemption checks at
>>> all.
>>>>>> And we probably don't want to jam other future IP resets into a single caller.
>>>>>> If you look at the kfd2kgd callbacks, most are pretty much copy and paste from
>>> one
>>>>> generation to another.
>>>>>> I don't see how putting the test in the callback makes it less maintainable.
>>>>>>
>>>>>
>>>>> My thought process was this could be put in - reset_queues_on_hws_hang
>>>>> and anything similar handles MES based queue resets. What you are saying
>>>>> there won't be anything like reset_queue_by_mes() for MES based resets.
>>>>> Is that correct?
>>>>
>>>> No the opposite.  But now we'd have to remember to put it in 2 places where
>>> there's still no visible test for gpu reset bypass in the same file.
>>>> SDMA resets are also being implemented and will probably have to be called in
>>> different places in the KFD as well.
>>>> We can look at consolidating this later as more devices and IPs get done if it
>>> makes sense to abstract this stuff.
>>>> My point is, the callback does the reset and returns a reset status.
>>>> Bypassing by fail return seems easier to remember and leverage.
>>>
>>> Ok, we have SDMA queue reset called from job timeouts. If it's getting
>>> called from KFD too, will look at consolidating that one.
>>>
>>> BTW, if this is returning a fake success, won't it result in a print
>>> like queue reset succeeded which gives the false impression that queue
>>> reset happened? Or, should it return a different error code like
>>> 'ECANCELED' since operation is intentionally skipped through module param
>>
>> Well ... the call is supposed to return an address of which queue got reset where a null return indicates "no queue got reset".
> 
> We should also somehow indicate to the user that the queue indeed ran
> into a reset situation. Not sure if that is taken care if address is
> returned as NULL.
> 
>> I'm thinking to make this simpler, maybe we change reset_queues_on_hws into a wrapper that takes in a queue type input (compute, sdma etc) and branches to the right reset call based on input type.
>> That way we only need 1 place to do the gpu_recovery enablement check in the KFD, and the KFD has the flexibility to call this wrapper where ever it wants to without having to worry about the module parameter status in the future.
> 
> Yes, this was the intention of original comment - to add at caller place
> rather than inside callback. It provides a one-place (or fewer places)
> control of the usage.
> 
> While at it, suggest to use amdgpu_device_should_recover_gpu(). This
> will give an info message if recovery is disabled, and we could control
> different meanings of this param (Ex: gpu_recovery = 2, avoid sdma queue
> reset alone) through the same function.
> 

As an afterthought, realized that this function doesn't have a param for
the type of recovery attempted. This may be added later if at all we
have a situation to assign new param values like skip reset for sdma
alone. Would still recommend to use the same function instead of direct
value check of param.

Thanks,
Lijo

> Thanks,
> Lijo
> 
>>
>> Jon
>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> That being said, putting the test in hqd_get_pq_addr was probably overkill, but I
>>> don't think anyone really cares to use it with gpu recovery off on its own at the
>>> moment.
>>>>
>>>> Jon
>>>>
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>> Jon
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>>> Jon
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Lijo
>>>>>>>>>
>>>>>>>>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>>>>>>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
> 




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux