RE: [PATCH] drm/amdgpu: Add recovery_lock to save bad pages function

"Zhou1, Tao" <Tao.Zhou1@xxxxxxx> · Wed, 17 Nov 2021 03:28:27 +0000



Reviewed-by: Tao Zhou <tao.zhou1@xxxxxxx>

> -----Original Message-----
> From: Li, Candice <Candice.Li@xxxxxxx>
> Sent: Wednesday, November 17, 2021 11:08 AM
> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Clements, John <John.Clements@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu: Add recovery_lock to save bad pages
> function
> 
> [Public]
> 
> Thanks for the review, Tao. Updated the position for unlocking.
> 
> Fix race condition failure during UMC UE injection.
> 
> Signed-off-by: Candice Li <candice.li@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 08133de21fdd63..53b957a5b9a65c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1935,9 +1935,11 @@ int amdgpu_ras_save_bad_pages(struct
> amdgpu_device *adev)
>         if (!con || !con->eh_data)
>                 return 0;
> 
> +       mutex_lock(&con->recovery_lock);
>         control = &con->eeprom_control;
>         data = con->eh_data;
>         save_count = data->count - control->ras_num_recs;
> +       mutex_unlock(&con->recovery_lock);
>         /* only new entries are saved */
>         if (save_count > 0) {
>                 if (amdgpu_ras_eeprom_append(control,
> --
> 2.17.1
> 
> 
> 
> Thanks,
> Candice
> 
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> Sent: Tuesday, November 16, 2021 4:27 PM
> To: Li, Candice <Candice.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Clements, John <John.Clements@xxxxxxx>
> Subject: RE: [PATCH] drm/amdgpu: Add recovery_lock to save bad pages
> function
> 
> [AMD Official Use Only]
> 
> > -----Original Message-----
> > From: Li, Candice <Candice.Li@xxxxxxx>
> > Sent: Tuesday, November 16, 2021 4:02 PM
> > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Clements, John <John.Clements@xxxxxxx>; Zhou1, Tao
> > <Tao.Zhou1@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> > Subject: [PATCH] drm/amdgpu: Add recovery_lock to save bad pages
> > function
> >
> > Fix race condition failure during UMC UE injection.
> >
> > Signed-off-by: Candice Li <candice.li@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index 08133de21fdd63..711b5fb26d47d4 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -1931,10 +1931,12 @@ int amdgpu_ras_save_bad_pages(struct
> > amdgpu_device *adev)
> >       struct ras_err_handler_data *data;
> >       struct amdgpu_ras_eeprom_control *control;
> >       int save_count;
> > +     int ret = 0;
> >
> >       if (!con || !con->eh_data)
> >               return 0;
> >
> > +     mutex_lock(&con->recovery_lock);
> >       control = &con->eeprom_control;
> >       data = con->eh_data;
> >       save_count = data->count - control->ras_num_recs; @@ -1944,13
> 
> [Tao] Since recovery_lock is dedicated to protecting eh_data, can we unlock it
> here?
> 
> > +1946,16 @@ int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev)
> >                                            &data->bps[control->ras_num_recs],
> >                                            save_count)) {
> >                       dev_err(adev->dev, "Failed to save EEPROM table
> > data!");
> > -                     return -EIO;
> > +                     ret = -EIO;
> > +                     goto out;
> >               }
> >
> >               dev_info(adev->dev, "Saved %d pages to EEPROM table.\n",
> > save_count);
> >       }
> >
> > -     return 0;
> > +out:
> > +     mutex_unlock(&con->recovery_lock);
> > +     return ret;
> >  }
> >
> >  /*
> > --
> > 2.17.1