RE: [PATCH] drm/amdgpu: report bad status in GPU recovery

"Zhou1, Tao" <Tao.Zhou1@xxxxxxx> · Thu, 1 Aug 2024 07:00:14 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

We need to perform gpu reset for HW and only make the reset  flow failing from driver perspective.

Tao

> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> Sent: Thursday, August 1, 2024 2:41 PM
> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Zhang, Hawking
> <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
>
>
> On 8/1/2024 11:28 AM, Zhou1, Tao wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Yes, the bad status message is printed twice with this patch. I think it's harmless
> and the second message is more convenient for customer.
> >
> > I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable
> the first message if you think printing message twice is not a good idea.
> >
>
> Instead of this way, can't this be added to amdgpu_ras_do_recovery() and stop all
> recovery actions?
>
> Thanks,
> Lijo
>
> > Tao
> >
> >> -----Original Message-----
> >> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> >> Sent: Thursday, August 1, 2024 1:30 PM
> >> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >>
> >> Right, it's functional. My concern is whether the kernel message in
> >> amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is
> >> the end of gpu recovery (i.e., report gpu reset failed or gpu reset succeed).
> >> Check_err_threshold was already done before reaching here.
> >>
> >> Regards,
> >> Hawking
> >>
> >> -----Original Message-----
> >> From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> >> Sent: Thursday, August 1, 2024 11:49
> >> To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>;
> >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>
> >> [AMD Official Use Only - AMD Internal Distribution Only]
> >>
> >> I think the if condition in amdgpu_ras_eeprom_check_err_threshold is
> >> good enough, no need to update it with is_rma.
> >>
> >> Tao
> >>
> >>> -----Original Message-----
> >>> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> >>> Sent: Thursday, August 1, 2024 11:00 AM
> >>> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>> Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> >>> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>>
> >>> [AMD Official Use Only - AMD Internal Distribution Only]
> >>>
> >>> Might consider leverage is_RMA flag for the same purpose?
> >>>
> >>> Regards,
> >>> Hawking
> >>>
> >>> -----Original Message-----
> >>> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
> >>> Tao Zhou
> >>> Sent: Wednesday, July 31, 2024 18:05
> >>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> >>> Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> >>> Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >>>
> >>> Instead of printing GPU reset failed.
> >>>
> >>> Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
> >>>  1 file changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index 355c2478c4b6..b7c967779b4b 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
> >>> amdgpu_device *adev,
> >>>                 tmp_adev->asic_reset_res = 0;
> >>>
> >>>                 if (r) {
> >>> -                       /* bad news, how to tell it to userspace ? */
> >>> -                       dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
> >>> atomic_read(&tmp_adev->gpu_reset_counter));
> >>> +                       /* bad news, how to tell it to userspace ?
> >>> +                        * for ras error, we should report GPU bad status instead of
> >>> +                        * reset failure
> >>> +                        */
> >>> +                       if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> >>> +                               dev_info(tmp_adev->dev, "GPU
> >>> + reset(%d) failed\n",
> >>> +
> >>> + atomic_read(&tmp_adev->gpu_reset_counter));
> >>>                         amdgpu_vf_error_put(tmp_adev,
> >>> AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
> >>>                 } else {
> >>>                         dev_info(tmp_adev->dev, "GPU reset(%d)
> >>> succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
> >>> --
> >>> 2.34.1
> >>>
> >>
> >>
> >