RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

"Zhang, Hawking" <Hawking.Zhang@xxxxxxx> · Tue, 14 Feb 2023 10:03:26 +0000

[AMD Official Use Only - General]

-EINVAL looks better than EEXIST, but it still doesn't fit the case? Alternatively, Can we compare the count before and after amdgpu_ras_add_bad_pages to decide whether reset ue_count to 0 or not. That could be straightforward than struggling for returning which error code in this specific case.

Regards,
Hawking

-----Original Message-----
From: Yang, Stanley <Stanley.Yang@xxxxxxx>
Sent: Tuesday, February 14, 2023 10:42
To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

[AMD Official Use Only - General]

> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> Sent: Monday, February 13, 2023 4:25 PM
> To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai,
> Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> new bad page
>
> [AMD Official Use Only - General]
>
> > -----Original Message-----
> > From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> > Sent: Friday, February 10, 2023 11:02 PM
> > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
> > Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai, Thomas
> > <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if
> > no new bad page
> >
> > [AMD Official Use Only - General]
> >
> > +                       /* if no new bad page is found, no need to
> > + increase ue count
> */
> > +                       if (ret == -EEXIST)
> > +                               err_data->ue_count = 0;
> >
> > Returning EEXIST in such case is not reasonable. Might consider
> > return a bool for
> > amdgpu_ras_add_bad_pages: true means it does add some new bad page;
> > false means it doesn't change anything.
> >
> > Regards,
> > Hawking
>
> [Tao] but it can returns -ENOMEM, amdgpu_ras_load_bad_pages and
> amdgpu_ras_recovery_init also need to check the return value. I'd like
> to keep the type of return value unchanged.
> How about -EINVAL?

Stanley: How about return -EALREADY?

Regards,
Stanley
>
> >
> > -----Original Message-----
> > From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> > Sent: Friday, February 10, 2023 16:45
> > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang, Hawking
> > <Hawking.Zhang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>;
> Chai,
> > Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> > Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> > Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> new
> > bad page
> >
> > If a UMC bad page is reserved but not freed by an application, the
> > application may trigger uncorrectable error repeatly by accessing the page.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
> > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index e85c4689ce2c..eafe01a24349 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> > amdgpu_device *adev,  {
> >         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >         struct ras_err_handler_data *data;
> > -       int ret = 0;
> > +       int ret = 0, old_cnt;
> >         uint32_t i;
> >
> >         if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
> > +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> >         if (!data)
> >                 goto out;
> >
> > +       old_cnt = data->count;
> > +
> >         for (i = 0; i < pages; i++) {
> >                 if (amdgpu_ras_check_bad_page_unlock(con,
> >                         bps[i].retired_page <<
> > AMDGPU_GPU_PAGE_SHIFT)) @@ -2079,6
> > +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> *adev,
> >                 data->count++;
> >                 data->space_left--;
> >         }
> > +
> > +       /* all pages have been reserved before, no new bad page */
> > +       if (old_cnt == data->count)
> > +               ret = -EEXIST;
> > +
> >  out:
> >         mutex_unlock(&con->recovery_lock);
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index 1c7fcb4f2380..772c431e4065 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > @@ -145,8 +145,12 @@ static int
> amdgpu_umc_do_page_retirement(struct
> > amdgpu_device *adev,
> >
> >                 if ((amdgpu_bad_page_threshold != 0) &&
> >                         err_data->err_addr_cnt) {
> > -                       amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> > +                       ret = amdgpu_ras_add_bad_pages(adev,
> > + err_data->err_addr,
> >
> > err_data->err_addr_cnt);
> > +                       /* if no new bad page is found, no need to
> > + increase ue count
> */
> > +                       if (ret == -EEXIST)
> > +                               err_data->ue_count = 0;
> > +
> >                         amdgpu_ras_save_bad_pages(adev);
> >
> >                         amdgpu_dpm_send_hbm_bad_pages_num(adev, con-
> > >eeprom_control.ras_num_recs);
> > --
> > 2.35.1
> >
>