RE: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

"Kim, Jonathan" <Jonathan.Kim@xxxxxxx> · Fri, 10 Jan 2025 16:13:10 +0000

[Public]

> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> Sent: Thursday, January 9, 2025 10:39 PM
> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset
>
>
>
> On 1/9/2025 8:27 PM, Kim, Jonathan wrote:
> > [Public]
> >
> >> -----Original Message-----
> >> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> >> Sent: Thursday, January 9, 2025 1:14 AM
> >> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx>
> >> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset
> >>
> >>
> >>
> >> On 1/9/2025 1:31 AM, Jonathan Kim wrote:
> >>> Per queue reset should be bypassed when gpu recovery is disabled
> >>> with module parameter.
> >>>
> >>> Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++++++
> >>>  1 file changed, 6 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
> >>> index cc66ebb7bae1..441568163e20 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
> >>> @@ -1131,6 +1131,9 @@ uint64_t kgd_gfx_v9_hqd_get_pq_addr(struct
> >> amdgpu_device *adev,
> >>>     uint32_t low, high;
> >>>     uint64_t queue_addr = 0;
> >>>
> >>> +   if (!amdgpu_gpu_recovery)
> >>> +           return 0;
> >>> +
> >>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
> >>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
> >>>
> >>> @@ -1179,6 +1182,9 @@ uint64_t kgd_gfx_v9_hqd_reset(struct
> amdgpu_device
> >> *adev,
> >>>     uint32_t low, high, pipe_reset_data = 0;
> >>>     uint64_t queue_addr = 0;
> >>>
> >>> +   if (!amdgpu_gpu_recovery)
> >>> +           return 0;
> >>> +
> >>
> >> I think the right place for this check is not inside callback, should be
> >> from the place where the callback gets called.
> >
> > I don't think it really matters.  We're going to have different reset types in the future
> that my come from different callers.
> > It's probably easier to remember to put the bypass where the reset is actually
> happening.
> >
>
> If I want to define something like amdgpu_gpu_recovery=2 (don't do queue
> reset but perform other resets), then it matters.

I don't get why that matters.
This callback alone, for example, calls 2 types of resets within itself (queue then pipe).
If you wanted partial resets between queue and pipe in this case, you'd have to branch test within the callback itself.
GPU reset bypass checks are invisible to the KFD section of the code as well.

>
> Since this is a callback, keeping it at the wrapper place may be more
> maintainable rather than keeping the check for gfx10/11/12 etc.

Maybe not.  MES is preemption checks are not like MEC preemption checks at all.
And we probably don't want to jam other future IP resets into a single caller.
If you look at the kfd2kgd callbacks, most are pretty much copy and paste from one generation to another.
I don't see how putting the test in the callback makes it less maintainable.

Jon

>
> Thanks,
> Lijo
>
> > Jon
> >
> >>
> >> Thanks,
> >> Lijo
> >>
> >>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
> >>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
> >>>
> >