On 7/31/2024 3:35 PM, Tao Zhou wrote: > Instead of printing GPU reset failed. > > Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 355c2478c4b6..b7c967779b4b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > tmp_adev->asic_reset_res = 0; > > if (r) { > - /* bad news, how to tell it to userspace ? */ > - dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", atomic_read(&tmp_adev->gpu_reset_counter)); > + /* bad news, how to tell it to userspace ? > + * for ras error, we should report GPU bad status instead of > + * reset failure > + */ > + if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) > + dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", > + atomic_read(&tmp_adev->gpu_reset_counter)); Better to check reset_context.src == AMDGPU_RESET_SRC_RAS to confirm that the reset is indeed triggered due to ras error. Thanks, Lijo > amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r); > } else { > dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));