On 8/1/2024 11:28 AM, Zhou1, Tao wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > [AMD Official Use Only - AMD Internal Distribution Only] > > Yes, the bad status message is printed twice with this patch. I think it's harmless and the second message is more convenient for customer. > > I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable the first message if you think printing message twice is not a good idea. > Instead of this way, can't this be added to amdgpu_ras_do_recovery() and stop all recovery actions? Thanks, Lijo > Tao > >> -----Original Message----- >> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx> >> Sent: Thursday, August 1, 2024 1:30 PM >> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery >> >> [AMD Official Use Only - AMD Internal Distribution Only] >> >> Right, it's functional. My concern is whether the kernel message in >> amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is the end >> of gpu recovery (i.e., report gpu reset failed or gpu reset succeed). >> Check_err_threshold was already done before reaching here. >> >> Regards, >> Hawking >> >> -----Original Message----- >> From: Zhou1, Tao <Tao.Zhou1@xxxxxxx> >> Sent: Thursday, August 1, 2024 11:49 >> To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery >> >> [AMD Official Use Only - AMD Internal Distribution Only] >> >> I think the if condition in amdgpu_ras_eeprom_check_err_threshold is good >> enough, no need to update it with is_rma. >> >> Tao >> >>> -----Original Message----- >>> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx> >>> Sent: Thursday, August 1, 2024 11:00 AM >>> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx> >>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery >>> >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> Might consider leverage is_RMA flag for the same purpose? >>> >>> Regards, >>> Hawking >>> >>> -----Original Message----- >>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Tao >>> Zhou >>> Sent: Wednesday, July 31, 2024 18:05 >>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx> >>> Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery >>> >>> Instead of printing GPU reset failed. >>> >>> Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++-- >>> 1 file changed, 7 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 355c2478c4b6..b7c967779b4b 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct >>> amdgpu_device *adev, >>> tmp_adev->asic_reset_res = 0; >>> >>> if (r) { >>> - /* bad news, how to tell it to userspace ? */ >>> - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", >>> atomic_read(&tmp_adev->gpu_reset_counter)); >>> + /* bad news, how to tell it to userspace ? >>> + * for ras error, we should report GPU bad status instead of >>> + * reset failure >>> + */ >>> + if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) >>> + dev_info(tmp_adev->dev, "GPU reset(%d) >>> + failed\n", >>> + >>> + atomic_read(&tmp_adev->gpu_reset_counter)); >>> amdgpu_vf_error_put(tmp_adev, >>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r); >>> } else { >>> dev_info(tmp_adev->dev, "GPU reset(%d) >>> succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter)); >>> -- >>> 2.34.1 >>> >> >> >