RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

"Zhou1, Tao" <Tao.Zhou1@xxxxxxx> · Tue, 14 Feb 2023 10:27:27 +0000

[AMD Official Use Only - General]

OK, I'll add a new function to do the check.

Tao

> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> Sent: Tuesday, February 14, 2023 6:03 PM
> To: Yang, Stanley <Stanley.Yang@xxxxxxx>; Zhou1, Tao
> <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Chai, Thomas
> <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
> bad page
>
> [AMD Official Use Only - General]
>
> -EINVAL looks better than EEXIST, but it still doesn't fit the case? Alternatively,
> Can we compare the count before and after amdgpu_ras_add_bad_pages to
> decide whether reset ue_count to 0 or not. That could be straightforward than
> struggling for returning which error code in this specific case.
>
> Regards,
> Hawking
>
> -----Original Message-----
> From: Yang, Stanley <Stanley.Yang@xxxxxxx>
> Sent: Tuesday, February 14, 2023 10:42
> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Zhang, Hawking
> <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Chai, Thomas
> <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
> bad page
>
> [AMD Official Use Only - General]
>
>
>
> > -----Original Message-----
> > From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> > Sent: Monday, February 13, 2023 4:25 PM
> > To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-
> > gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai,
> > Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> > new bad page
> >
> > [AMD Official Use Only - General]
> >
> > > -----Original Message-----
> > > From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> > > Sent: Friday, February 10, 2023 11:02 PM
> > > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
> > > Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai, Thomas
> > > <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> > > Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if
> > > no new bad page
> > >
> > > [AMD Official Use Only - General]
> > >
> > > +                       /* if no new bad page is found, no need to
> > > + increase ue count
> > */
> > > +                       if (ret == -EEXIST)
> > > +                               err_data->ue_count = 0;
> > >
> > > Returning EEXIST in such case is not reasonable. Might consider
> > > return a bool for
> > > amdgpu_ras_add_bad_pages: true means it does add some new bad page;
> > > false means it doesn't change anything.
> > >
> > > Regards,
> > > Hawking
> >
> > [Tao] but it can returns -ENOMEM, amdgpu_ras_load_bad_pages and
> > amdgpu_ras_recovery_init also need to check the return value. I'd like
> > to keep the type of return value unchanged.
> > How about -EINVAL?
>
> Stanley: How about return -EALREADY?
>
> Regards,
> Stanley
> >
> > >
> > > -----Original Message-----
> > > From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> > > Sent: Friday, February 10, 2023 16:45
> > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang, Hawking
> > > <Hawking.Zhang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>;
> > Chai,
> > > Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> > > Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> > > Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no
> > new
> > > bad page
> > >
> > > If a UMC bad page is reserved but not freed by an application, the
> > > application may trigger uncorrectable error repeatly by accessing the page.
> > >
> > > Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
> > > ---
> > >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
> > >  2 files changed, 13 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > index e85c4689ce2c..eafe01a24349 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> > > amdgpu_device *adev,  {
> > >         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> > >         struct ras_err_handler_data *data;
> > > -       int ret = 0;
> > > +       int ret = 0, old_cnt;
> > >         uint32_t i;
> > >
> > >         if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
> > > +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> > *adev,
> > >         if (!data)
> > >                 goto out;
> > >
> > > +       old_cnt = data->count;
> > > +
> > >         for (i = 0; i < pages; i++) {
> > >                 if (amdgpu_ras_check_bad_page_unlock(con,
> > >                         bps[i].retired_page <<
> > > AMDGPU_GPU_PAGE_SHIFT)) @@ -2079,6
> > > +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
> > *adev,
> > >                 data->count++;
> > >                 data->space_left--;
> > >         }
> > > +
> > > +       /* all pages have been reserved before, no new bad page */
> > > +       if (old_cnt == data->count)
> > > +               ret = -EEXIST;
> > > +
> > >  out:
> > >         mutex_unlock(&con->recovery_lock);
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > > index 1c7fcb4f2380..772c431e4065 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > > @@ -145,8 +145,12 @@ static int
> > amdgpu_umc_do_page_retirement(struct
> > > amdgpu_device *adev,
> > >
> > >                 if ((amdgpu_bad_page_threshold != 0) &&
> > >                         err_data->err_addr_cnt) {
> > > -                       amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> > > +                       ret = amdgpu_ras_add_bad_pages(adev,
> > > + err_data->err_addr,
> > >
> > > err_data->err_addr_cnt);
> > > +                       /* if no new bad page is found, no need to
> > > + increase ue count
> > */
> > > +                       if (ret == -EEXIST)
> > > +                               err_data->ue_count = 0;
> > > +
> > >                         amdgpu_ras_save_bad_pages(adev);
> > >
> > >                         amdgpu_dpm_send_hbm_bad_pages_num(adev, con-
> > > >eeprom_control.ras_num_recs);
> > > --
> > > 2.35.1
> > >
> >