[AMD Official Use Only - AMD Internal Distribution Only] We need to perform gpu reset for HW and only make the reset flow failing from driver perspective. Tao > -----Original Message----- > From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> > Sent: Thursday, August 1, 2024 2:41 PM > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/amdgpu: report bad status in GPU recovery > > > > On 8/1/2024 11:28 AM, Zhou1, Tao wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > Yes, the bad status message is printed twice with this patch. I think it's harmless > and the second message is more convenient for customer. > > > > I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable > the first message if you think printing message twice is not a good idea. > > > > Instead of this way, can't this be added to amdgpu_ras_do_recovery() and stop all > recovery actions? > > Thanks, > Lijo > > > Tao > > > >> -----Original Message----- > >> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx> > >> Sent: Thursday, August 1, 2024 1:30 PM > >> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery > >> > >> [AMD Official Use Only - AMD Internal Distribution Only] > >> > >> Right, it's functional. My concern is whether the kernel message in > >> amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is > >> the end of gpu recovery (i.e., report gpu reset failed or gpu reset succeed). > >> Check_err_threshold was already done before reaching here. > >> > >> Regards, > >> Hawking > >> > >> -----Original Message----- > >> From: Zhou1, Tao <Tao.Zhou1@xxxxxxx> > >> Sent: Thursday, August 1, 2024 11:49 > >> To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; > >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery > >> > >> [AMD Official Use Only - AMD Internal Distribution Only] > >> > >> I think the if condition in amdgpu_ras_eeprom_check_err_threshold is > >> good enough, no need to update it with is_rma. > >> > >> Tao > >> > >>> -----Original Message----- > >>> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx> > >>> Sent: Thursday, August 1, 2024 11:00 AM > >>> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >>> Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx> > >>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery > >>> > >>> [AMD Official Use Only - AMD Internal Distribution Only] > >>> > >>> Might consider leverage is_RMA flag for the same purpose? > >>> > >>> Regards, > >>> Hawking > >>> > >>> -----Original Message----- > >>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of > >>> Tao Zhou > >>> Sent: Wednesday, July 31, 2024 18:05 > >>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >>> Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx> > >>> Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery > >>> > >>> Instead of printing GPU reset failed. > >>> > >>> Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++-- > >>> 1 file changed, 7 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> index 355c2478c4b6..b7c967779b4b 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct > >>> amdgpu_device *adev, > >>> tmp_adev->asic_reset_res = 0; > >>> > >>> if (r) { > >>> - /* bad news, how to tell it to userspace ? */ > >>> - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", > >>> atomic_read(&tmp_adev->gpu_reset_counter)); > >>> + /* bad news, how to tell it to userspace ? > >>> + * for ras error, we should report GPU bad status instead of > >>> + * reset failure > >>> + */ > >>> + if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) > >>> + dev_info(tmp_adev->dev, "GPU > >>> + reset(%d) failed\n", > >>> + > >>> + atomic_read(&tmp_adev->gpu_reset_counter)); > >>> amdgpu_vf_error_put(tmp_adev, > >>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r); > >>> } else { > >>> dev_info(tmp_adev->dev, "GPU reset(%d) > >>> succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter)); > >>> -- > >>> 2.34.1 > >>> > >> > >> > >