[AMD Official Use Only] To explain more - It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events? Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea. Thanks, Lijo -----Original Message----- From: Sharma, Shashank <Shashank.Sharma@xxxxxxx> Sent: Friday, February 4, 2022 10:09 PM To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Somalapuram, Amaranath <Amaranath.Somalapuram@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler Hey Lijo, I somehow missed to respond on this comment, pls find inline: Regards Shashank On 1/22/2022 7:42 AM, Lazar, Lijo wrote: > > > On 1/22/2022 2:04 AM, Sharma, Shashank wrote: >> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 00:00:00 >> 2001 >> From: Somalapuram Amaranath <Amaranath.Somalapuram@xxxxxxx> >> Date: Fri, 21 Jan 2022 14:19:42 +0530 >> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >> >> This patch adds a GPU reset handler for Navi ASIC family, which >> typically dumps some of the registersand sends a trace event. >> >> V2: Accomodated call to work function to send uevent >> >> Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@xxxxxxx> >> Signed-off-by: Shashank Sharma <shashank.sharma@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >> 1 file changed, 28 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245 >> 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device >> *adev) >> } >> } >> >> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) { >> + int r = 0, i; >> + >> + /* original raven doesn't have full asic reset */ >> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >> + return; >> + for (i = 0; i < adev->num_ip_blocks; i++) { >> + if (!adev->ip_blocks[i].status.valid) >> + continue; >> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >> + continue; >> + r = >> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >> + >> + if (r) >> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed >> +%d\n", >> + adev->ip_blocks[i].version->funcs->name, r); >> + } >> + >> + /* Schedule work to send uevent */ >> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >> + DRM_ERROR("failed to add GPU reset work\n"); >> + >> + dump_stack(); >> +} >> + >> static int nv_asic_reset(struct amdgpu_device *adev) >> { >> int ret = 0; >> >> + amdgpu_reset_dumps(adev); > > Had a comment on this before. Now there are different reasons (or even > no reason like a precautionary reset) to perform reset. A user would > be interested in a trace only if the reason is valid. > > To clarify on why a work shouldn't be scheduled on every reset, check > here - > > https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amd > gpu/amdgpu_drv.c#L2188 In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not. But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead. - Shashank > > > > Thanks, > Lijo > >> switch (nv_asic_reset_method(adev)) { >> case AMD_RESET_METHOD_PCI: >> dev_info(adev->dev, "PCI reset\n");