[AMD Official Use Only] No, otherwise driver reset only for GPU recovery purpose. S3/S4 is not meant for recovery purpose. It's just a precautionary reset to make sure that everything works fine on resume. BTW, since this is already providing a set of values it would be useful to provide one more field as the reset reason - RAS error recovery, GPU hung recovery or something else. Thanks, Lijo -----Original Message----- From: Sharma, Shashank <Shashank.Sharma@xxxxxxx> Sent: Friday, February 4, 2022 10:37 PM To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Somalapuram, Amaranath <Amaranath.Somalapuram@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler On 2/4/2022 6:02 PM, Lazar, Lijo wrote: > [Public] > > The problem is app doesn't know why the reset happened. It just receives a bunch of registers to be read. On what basis an app can filter this out? > Again, that is contextual analysis capability, which needs to be embedded in the reader app. Even if we filter out the S3/S4 resets in the kernel, the situation remains the same, isn't it ? - Shashank > Thanks, > Lijo > > -----Original Message----- > From: Sharma, Shashank <Shashank.Sharma@xxxxxxx> > Sent: Friday, February 4, 2022 10:29 PM > To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Somalapuram, > Amaranath <Amaranath.Somalapuram@xxxxxxx>; Koenig, Christian > <Christian.Koenig@xxxxxxx> > Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler > > > > On 2/4/2022 5:50 PM, Lazar, Lijo wrote: >> [AMD Official Use Only] >> >> To explain more - >> It's an unconditional reset done by the kernel on every suspend (S3/S4). In such a case which process is going to receive the trace events? >> >> Most likely use case would be related to gpu recovery. Triggering a trace on every reset doesn't look like a good idea. >> > > If you observer carefully, we are just providing an infrastructure, the application's intention is unknown to us. In my opinion it's rather not a good idea to apply a filter in kernel, with our interpretation of intention. > > For example if an app just wants to count how many resets are happening due to S3/S4 transition, this infra might become useless. It would rather be a better idea for the app to learn and ignore these scenarios which it is not interested in. > > This could eventually be just difference in design philosophy maybe :) > > - Shashank > >> Thanks, >> Lijo >> >> -----Original Message----- >> From: Sharma, Shashank <Shashank.Sharma@xxxxxxx> >> Sent: Friday, February 4, 2022 10:09 PM >> To: Lazar, Lijo <Lijo.Lazar@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Somalapuram, >> Amaranath <Amaranath.Somalapuram@xxxxxxx>; Koenig, Christian >> <Christian.Koenig@xxxxxxx> >> Subject: Re: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >> >> Hey Lijo, >> I somehow missed to respond on this comment, pls find inline: >> >> Regards >> Shashank >> >> On 1/22/2022 7:42 AM, Lazar, Lijo wrote: >>> >>> >>> On 1/22/2022 2:04 AM, Sharma, Shashank wrote: >>>> From 899ec6060eb7d8a3d4d56ab439e4e6cdd74190a4 Mon Sep 17 >>>> 00:00:00 >>>> 2001 >>>> From: Somalapuram Amaranath <Amaranath.Somalapuram@xxxxxxx> >>>> Date: Fri, 21 Jan 2022 14:19:42 +0530 >>>> Subject: [PATCH 4/4] drm/amdgpu/nv: add navi GPU reset handler >>>> >>>> This patch adds a GPU reset handler for Navi ASIC family, which >>>> typically dumps some of the registersand sends a trace event. >>>> >>>> V2: Accomodated call to work function to send uevent >>>> >>>> Signed-off-by: Somalapuram Amaranath >>>> <Amaranath.Somalapuram@xxxxxxx> >>>> Signed-off-by: Shashank Sharma <shashank.sharma@xxxxxxx> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/nv.c | 28 ++++++++++++++++++++++++++++ >>>> 1 file changed, 28 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c >>>> b/drivers/gpu/drm/amd/amdgpu/nv.c index 01efda4398e5..ada35d4c5245 >>>> 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/nv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c >>>> @@ -528,10 +528,38 @@ nv_asic_reset_method(struct amdgpu_device >>>> *adev) >>>> } >>>> } >>>> >>>> +static void amdgpu_reset_dumps(struct amdgpu_device *adev) { >>>> + int r = 0, i; >>>> + >>>> + /* original raven doesn't have full asic reset */ >>>> + if ((adev->apu_flags & AMD_APU_IS_RAVEN) && >>>> + !(adev->apu_flags & AMD_APU_IS_RAVEN2)) >>>> + return; >>>> + for (i = 0; i < adev->num_ip_blocks; i++) { >>>> + if (!adev->ip_blocks[i].status.valid) >>>> + continue; >>>> + if (!adev->ip_blocks[i].version->funcs->reset_reg_dumps) >>>> + continue; >>>> + r = >>>> +adev->ip_blocks[i].version->funcs->reset_reg_dumps(adev); >>>> + >>>> + if (r) >>>> + DRM_ERROR("reset_reg_dumps of IP block <%s> failed >>>> +%d\n", >>>> + adev->ip_blocks[i].version->funcs->name, r); >>>> + } >>>> + >>>> + /* Schedule work to send uevent */ >>>> + if (!queue_work(system_unbound_wq, &adev->gpu_reset_work)) >>>> + DRM_ERROR("failed to add GPU reset work\n"); >>>> + >>>> + dump_stack(); >>>> +} >>>> + >>>> static int nv_asic_reset(struct amdgpu_device *adev) >>>> { >>>> int ret = 0; >>>> >>>> + amdgpu_reset_dumps(adev); >>> >>> Had a comment on this before. Now there are different reasons (or >>> even no reason like a precautionary reset) to perform reset. A user >>> would be interested in a trace only if the reason is valid. >>> >>> To clarify on why a work shouldn't be scheduled on every reset, >>> check here - >>> >>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/a >>> m >>> d >>> gpu/amdgpu_drv.c#L2188 >> In the example you pointed to, they have a criteria to decide what is a valid reset in their context, in the kernel side itself. So they can take a call if they want to do something about it or not. >> >> But, in our case, we want to send the trace_event to user with some register values on every reset, and it is actually up to the profiling app to interpret (along with what it wants to call a GPU reset). So I don't think this is causing a considerable overhead. >> >> - Shashank >>> >>> >>> >>> Thanks, >>> Lijo >>> >>>> switch (nv_asic_reset_method(adev)) { >>>> case AMD_RESET_METHOD_PCI: >>>> dev_info(adev->dev, "PCI reset\n");