On 8/1/2024 9:17 AM, Zhou1, Tao wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > >> -----Original Message----- >> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> >> Sent: Wednesday, July 31, 2024 9:31 PM >> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Subject: Re: [PATCH] drm/amdgpu: report bad status in GPU recovery >> >> >> >> On 7/31/2024 3:35 PM, Tao Zhou wrote: >>> Instead of printing GPU reset failed. >>> >>> Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++-- >>> 1 file changed, 7 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 355c2478c4b6..b7c967779b4b 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct >> amdgpu_device *adev, >>> tmp_adev->asic_reset_res = 0; >>> >>> if (r) { >>> - /* bad news, how to tell it to userspace ? */ >>> - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", >> atomic_read(&tmp_adev->gpu_reset_counter)); >>> + /* bad news, how to tell it to userspace ? >>> + * for ras error, we should report GPU bad status instead >> of >>> + * reset failure >>> + */ >>> + if >> (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) >>> + dev_info(tmp_adev->dev, "GPU reset(%d) >> failed\n", >>> + atomic_read(&tmp_adev- >>> gpu_reset_counter)); >> >> Better to check reset_context.src == AMDGPU_RESET_SRC_RAS to confirm that >> the reset is indeed triggered due to ras error. > > [Tao] It seems AMDGPU_RESET_SRC_RAS is not used currently, I will set it before use the flag. > It's set here - https://elixir.bootlin.com/linux/v6.11-rc1/source/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c#L2607 Thanks, Lijo >> >> Thanks, >> Lijo >> >>> amdgpu_vf_error_put(tmp_adev, >> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r); >>> } else { >>> dev_info(tmp_adev->dev, "GPU reset(%d) >> succeeded!\n", >>> atomic_read(&tmp_adev->gpu_reset_counter));