RE: [PATCH] drm/amdgpu: report bad status in GPU recovery

"Zhou1, Tao" <Tao.Zhou1@xxxxxxx> · Thu, 1 Aug 2024 03:47:40 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> Sent: Wednesday, July 31, 2024 9:31 PM
> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
>
>
> On 7/31/2024 3:35 PM, Tao Zhou wrote:
> > Instead of printing GPU reset failed.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 355c2478c4b6..b7c967779b4b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
> amdgpu_device *adev,
> >             tmp_adev->asic_reset_res = 0;
> >
> >             if (r) {
> > -                   /* bad news, how to tell it to userspace ? */
> > -                   dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
> atomic_read(&tmp_adev->gpu_reset_counter));
> > +                   /* bad news, how to tell it to userspace ?
> > +                    * for ras error, we should report GPU bad status instead
> of
> > +                    * reset failure
> > +                    */
> > +                   if
> (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> > +                           dev_info(tmp_adev->dev, "GPU reset(%d)
> failed\n",
> > +                                   atomic_read(&tmp_adev-
> >gpu_reset_counter));
>
> Better to check reset_context.src == AMDGPU_RESET_SRC_RAS to confirm that
> the reset is indeed triggered due to ras error.

[Tao] It seems AMDGPU_RESET_SRC_RAS is not used currently, I will set it before use the flag.

>
> Thanks,
> Lijo
>
> >                     amdgpu_vf_error_put(tmp_adev,
> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
> >             } else {
> >                     dev_info(tmp_adev->dev, "GPU reset(%d)
> succeeded!\n",
> > atomic_read(&tmp_adev->gpu_reset_counter));