RE: [PATCH] drm/amdgpu: report bad status in GPU recovery

"Zhang, Hawking" <Hawking.Zhang@xxxxxxx> · Thu, 1 Aug 2024 07:50:04 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

How about replace amdgpu_ras_eeprom_check_err_threshold with ras->is_rma and move amdgpu_ras_eeprom_check_err_threshold to end of gpu recovery as the change you made. In such case we don;'t need to add a paremter in the function just to disable a few messages. And also the message will be the end of gpu recovery as a reminder.

amdgpu_do_asic_reset
if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev)) {
/* must succeed. */
amdgpu_ras_resume(tmp_adev);
} else {
        r = -EINVAL;
        goto out;
}

Regards,
Hawking

-----Original Message-----
From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
Sent: Thursday, August 1, 2024 13:58
To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery

[AMD Official Use Only - AMD Internal Distribution Only]

Yes, the bad status message is printed twice with this patch. I think it's harmless and the second message is more convenient for customer.

I can add a parameter for amdgpu_ras_eeprom_check_err_threshold to disable the first message if you think printing message twice is not a good idea.

Tao

> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> Sent: Thursday, August 1, 2024 1:30 PM
> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Right, it's functional. My concern is whether the kernel message in
> amdgpu_ras_eeprom_check_err_threshold will be printed twice. This is
> the end of gpu recovery (i.e., report gpu reset failed or gpu reset succeed).
> Check_err_threshold was already done before reaching here.
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> Sent: Thursday, August 1, 2024 11:49
> To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>;
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> I think the if condition in amdgpu_ras_eeprom_check_err_threshold is
> good enough, no need to update it with is_rma.
>
> Tao
>
> > -----Original Message-----
> > From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> > Sent: Thursday, August 1, 2024 11:00 AM
> > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> > Subject: RE: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Might consider leverage is_RMA flag for the same purpose?
> >
> > Regards,
> > Hawking
> >
> > -----Original Message-----
> > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
> > Tao Zhou
> > Sent: Wednesday, July 31, 2024 18:05
> > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> > Subject: [PATCH] drm/amdgpu: report bad status in GPU recovery
> >
> > Instead of printing GPU reset failed.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 355c2478c4b6..b7c967779b4b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -5933,8 +5933,13 @@ int amdgpu_device_gpu_recover(struct
> > amdgpu_device *adev,
> >                 tmp_adev->asic_reset_res = 0;
> >
> >                 if (r) {
> > -                       /* bad news, how to tell it to userspace ? */
> > -                       dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
> > atomic_read(&tmp_adev->gpu_reset_counter));
> > +                       /* bad news, how to tell it to userspace ?
> > +                        * for ras error, we should report GPU bad status instead of
> > +                        * reset failure
> > +                        */
> > +                       if (!amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
> > +                               dev_info(tmp_adev->dev, "GPU
> > + reset(%d) failed\n",
> > +
> > + atomic_read(&tmp_adev->gpu_reset_counter));
> >                         amdgpu_vf_error_put(tmp_adev,
> > AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
> >                 } else {
> >                         dev_info(tmp_adev->dev, "GPU reset(%d)
> > succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
> > --
> > 2.34.1
> >
>
>