Re: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Also note that module parameters are global.  If you change the
parameter, it changes it for all GPUs in the system.  That may not be
what the customer wants.

Alex

On Thu, Jul 23, 2020 at 9:10 AM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
>
> I agree with Guchun as well.
>
> When you have a dynamic module parameter and change the bad page
> threshold the GPU might just stop working suddenly.
>
> That is not a good idea as far as I can see.
>
> Regards,
> Christian.
>
> Am 23.07.20 um 05:47 schrieb Chen, Guchun:
> > [AMD Public Use]
> >
> > Hi Dennis,
> >
> > To be honest, your suggestion is considered when I start the design. My thought is in actual world, bad page threshold is one static configuration, it should be set once when probing.
> > So module parameter is one ideal choice for this.
> >
> > Regards,
> > Guchun
> >
> > -----Original Message-----
> > From: Li, Dennis <Dennis.Li@xxxxxxx>
> > Sent: Thursday, July 23, 2020 8:32 AM
> > To: Chen, Guchun <Guchun.Chen@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>; Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Clements, John <John.Clements@xxxxxxx>
> > Subject: RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
> >
> > [AMD Official Use Only - Internal Distribution Only]
> >
> > Hi, Guchun,
> >        It is better to let user be able to change amdgpu_bad_page_threshold with sysfs, so that users no need to reboot system when they want to change their strategy.
> >
> > Best Regards
> > Dennis Li
> > -----Original Message-----
> > From: Chen, Guchun <Guchun.Chen@xxxxxxx>
> > Sent: Wednesday, July 22, 2020 11:14 AM
> > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Li, Dennis <Dennis.Li@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>; Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Clements, John <John.Clements@xxxxxxx>
> > Cc: Chen, Guchun <Guchun.Chen@xxxxxxx>
> > Subject: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
> >
> > bad_page_threshold could be specified to detect and retire bad GPU if faulty bad pages exceed it.
> >
> > When it's -1, ras will use typical bad page failure value.
> >
> > Signed-off-by: Guchun Chen <guchun.chen@xxxxxxx>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h     |  1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++++++++++
> >   2 files changed, 12 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index 06bfb8658dec..bb83ffb5e26a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level;  extern struct amdgpu_mgpu_info mgpu_info;  extern int amdgpu_ras_enable;  extern uint amdgpu_ras_mask;
> > +extern int amdgpu_bad_page_threshold;
> >   extern int amdgpu_async_gfx_ring;
> >   extern int amdgpu_mcbp;
> >   extern int amdgpu_discovery;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index d28b95f721c4..f99671101746 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = {  };  int amdgpu_ras_enable = -1;  uint amdgpu_ras_mask = 0xffffffff;
> > +int amdgpu_bad_page_threshold = -1;
> >
> >   /**
> >    * DOC: vramlimit (int)
> > @@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444);  MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)");  module_param_named(reset_method, amdgpu_reset_method, int, 0444);
> >
> > +/**
> > + * DOC: bad_page_threshold (int)
> > + * Bad page threshold configuration is driven by RMA(Return Merchandise
> > + * Authorization) policy, which is to specify the threshold value of
> > +faulty
> > + * pages detected by ECC, which may result in GPU's retirement if total
> > + * faulty pages by ECC exceed threshold value.
> > + */
> > +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 =
> > +auto(default typical value))"); module_param_named(bad_page_threshold,
> > +amdgpu_bad_page_threshold, int, 0444);
> > +
> >   static const struct pci_device_id pciidlist[] = {  #ifdef  CONFIG_DRM_AMDGPU_SI
> >       {0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
> > --
> > 2.17.1
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux