[AMD Official Use Only] > -----Original Message----- > From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > Sent: Tuesday, October 19, 2021 2:13 PM > To: Russell, Kent <Kent.Russell@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Tuikov, Luben <Luben.Tuikov@xxxxxxx>; Joshi, Mukul <Mukul.Joshi@xxxxxxx> > Subject: Re: [PATCH 3/4] drm/amdgpu: Add kernel parameter for ignoring bad page > threshold > > > Am 2021-10-19 um 1:50 p.m. schrieb Kent Russell: > > When a GPU hits the bad_page_threshold, it will not be initialized by > > the amdgpu driver. This means that the table cannot be cleared, nor can > > information gathering be performed (getting serial number, BDF, etc). > > Add an override called ignore_bad_page_threshold that can be set to true > > to still initialize the GPU, even when the bad page threshold has been > > reached. > Do you really need a new parameter for this? Wouldn't it be enough to > set bad_page_threshold to the VRAM size? You could use a new special > value (e.g. bad_page_threshold=-2) for that. Ah interesting. That could definitely work here. I hadn't thought about co-opting another variable. We already check -1, so why not -2? Great insight. Thanks! Kent > > Regards, > Felix > > > > > > Cc: Luben Tuikov <luben.tuikov@xxxxxxx> > > Cc: Mukul Joshi <Mukul.Joshi@xxxxxxx> > > Signed-off-by: Kent Russell <kent.russell@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 13 +++++++++++++ > > 2 files changed, 14 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > index d58e37fd01f4..b85b67a88a3d 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info; > > extern int amdgpu_ras_enable; > > extern uint amdgpu_ras_mask; > > extern int amdgpu_bad_page_threshold; > > +extern bool amdgpu_ignore_bad_page_threshold; > > extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer; > > extern int amdgpu_async_gfx_ring; > > extern int amdgpu_mcbp; > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > index 96bd63aeeddd..3e9a7b072888 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > > @@ -189,6 +189,7 @@ struct amdgpu_mgpu_info mgpu_info = { > > int amdgpu_ras_enable = -1; > > uint amdgpu_ras_mask = 0xffffffff; > > int amdgpu_bad_page_threshold = -1; > > +bool amdgpu_ignore_bad_page_threshold; > > struct amdgpu_watchdog_timer amdgpu_watchdog_timer = { > > .timeout_fatal_disable = false, > > .period = 0x0, /* default to 0x0 (timeout disable) */ > > @@ -880,6 +881,18 @@ module_param_named(reset_method, amdgpu_reset_method, > int, 0444); > > MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default > value), 0 = disable bad page retirement)"); > > module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444); > > > > +/** > > + * DOC: ignore_bad_page_threshold (bool) Bad page threshold specifies > > + * the threshold value of faulty pages detected by RAS ECC. Once the > > + * threshold is hit, the GPU will not be initialized. Use this parameter > > + * to ignore the bad page threshold so that information gathering can > > + * still be performed. This also allows for booting the GPU to clear > > + * the RAS EEPROM table. > > + */ > > + > > +MODULE_PARM_DESC(ignore_bad_page_threshold, "Ignore bad page threshold (false = > respect bad page threshold (default value)"); > > +module_param_named(ignore_bad_page_threshold, > amdgpu_ignore_bad_page_threshold, bool, 0644); > > + > > MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup > (8 if set to greater than 8 or less than 0, only affect gfx 8+)"); > > module_param_named(num_kcq, amdgpu_num_kcq, int, 0444); > >