[AMD Official Use Only - AMD Internal Distribution Only] Yes, the bad status message is printed twice with this patch. I think it's harmless and the second message is more convenient for customer. I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable the first message if you think printing message twice is not a good idea. Tao > -----Original Message----- > From: Zhang, Hawking <Hawking.Zhang@xxxxxxx> > Sent: Thursday, August 1, 2024 1:30 PM > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery > > [AMD Official Use Only - AMD Internal Distribution Only] > > Right, it's functional. My concern is whether the kernel message in > amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is the end > of gpu recovery (i.e., report gpu reset failed or gpu reset succeed). > Check_err_threshold was already done before reaching here. > > Regards, > Hawking > > -----Original Message----- > From: Zhou1, Tao <Tao.Zhou1@xxxxxxx> > Sent: Thursday, August 1, 2024 11:49 > To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery > > [AMD Official Use Only - AMD Internal Distribution Only] > > I think the if condition in amdgpu_ras_eeprom_check_err_threshold is good > enough, no need to update it with is_rma. > > Tao > > > -----Original Message----- > > From: Zhang, Hawking <Hawking.Zhang@xxxxxxx> > > Sent: Thursday, August 1, 2024 11:00 AM > > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx> > > Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery > > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > Might consider leverage is_RMA flag for the same purpose? > > > > Regards, > > Hawking > > > > -----Original Message----- > > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Tao > > Zhou > > Sent: Wednesday, July 31, 2024 18:05 > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx> > > Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery > > > > Instead of printing GPU reset failed. > > > > Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 355c2478c4b6..b7c967779b4b 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct > > amdgpu_device *adev, > > tmp_adev->asic_reset_res = 0; > > > > if (r) { > > - /* bad news, how to tell it to userspace ? */ > > - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", > > atomic_read(&tmp_adev->gpu_reset_counter)); > > + /* bad news, how to tell it to userspace ? > > + * for ras error, we should report GPU bad status instead of > > + * reset failure > > + */ > > + if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) > > + dev_info(tmp_adev->dev, "GPU reset(%d) > > + failed\n", > > + > > + atomic_read(&tmp_adev->gpu_reset_counter)); > > amdgpu_vf_error_put(tmp_adev, > > AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r); > > } else { > > dev_info(tmp_adev->dev, "GPU reset(%d) > > succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter)); > > -- > > 2.34.1 > > > >