On 1/10/2025 9:43 PM, Kim, Jonathan wrote: > [Public] > >> -----Original Message----- >> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> >> Sent: Thursday, January 9, 2025 10:39 PM >> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx> >> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset >> >> >> >> On 1/9/2025 8:27 PM, Kim, Jonathan wrote: >>> [Public] >>> >>>> -----Original Message----- >>>> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> >>>> Sent: Thursday, January 9, 2025 1:14 AM >>>> To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>> Cc: Kasiviswanathan, Harish <Harish.Kasiviswanathan@xxxxxxx> >>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset >>>> >>>> >>>> >>>> On 1/9/2025 1:31 AM, Jonathan Kim wrote: >>>>> Per queue reset should be bypassed when gpu recovery is disabled >>>>> with module parameter. >>>>> >>>>> Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++++++ >>>>> 1 file changed, 6 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c >>>>> index cc66ebb7bae1..441568163e20 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c >>>>> @@ -1131,6 +1131,9 @@ uint64_t kgd_gfx_v9_hqd_get_pq_addr(struct >>>> amdgpu_device *adev, >>>>> uint32_t low, high; >>>>> uint64_t queue_addr = 0; >>>>> >>>>> + if (!amdgpu_gpu_recovery) >>>>> + return 0; >>>>> + >>>>> kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst); >>>>> amdgpu_gfx_rlc_enter_safe_mode(adev, inst); >>>>> >>>>> @@ -1179,6 +1182,9 @@ uint64_t kgd_gfx_v9_hqd_reset(struct >> amdgpu_device >>>> *adev, >>>>> uint32_t low, high, pipe_reset_data = 0; >>>>> uint64_t queue_addr = 0; >>>>> >>>>> + if (!amdgpu_gpu_recovery) >>>>> + return 0; >>>>> + >>>> >>>> I think the right place for this check is not inside callback, should be >>>> from the place where the callback gets called. >>> >>> I don't think it really matters. We're going to have different reset types in the future >> that my come from different callers. >>> It's probably easier to remember to put the bypass where the reset is actually >> happening. >>> >> >> If I want to define something like amdgpu_gpu_recovery=2 (don't do queue >> reset but perform other resets), then it matters. > > I don't get why that matters. > This callback alone, for example, calls 2 types of resets within itself (queue then pipe). > If you wanted partial resets between queue and pipe in this case, you'd have to branch test within the callback itself. > GPU reset bypass checks are invisible to the KFD section of the code as well. > >> >> Since this is a callback, keeping it at the wrapper place may be more >> maintainable rather than keeping the check for gfx10/11/12 etc. > > Maybe not. MES is preemption checks are not like MEC preemption checks at all. > And we probably don't want to jam other future IP resets into a single caller. > If you look at the kfd2kgd callbacks, most are pretty much copy and paste from one generation to another. > I don't see how putting the test in the callback makes it less maintainable. > My thought process was this could be put in - reset_queues_on_hws_hang and anything similar handles MES based queue resets. What you are saying there won't be anything like reset_queue_by_mes() for MES based resets. Is that correct? Thanks, Lijo > Jon > >> >> Thanks, >> Lijo >> >>> Jon >>> >>>> >>>> Thanks, >>>> Lijo >>>> >>>>> kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst); >>>>> amdgpu_gfx_rlc_enter_safe_mode(adev, inst); >>>>> >>> >