On 2021-10-21 11:57, Kent Russell wrote: > When a GPU hits the bad_page_threshold, it will not be initialized by > the amdgpu driver. This means that the table cannot be cleared, nor can > information gathering be performed (getting serial number, BDF, etc). > > If the bad_page_threshold kernel parameter is set to -2, > continue to initialize the GPU, while printing a warning to dmesg that > this action has been done > > Cc: Luben Tuikov <luben.tuikov@xxxxxxx> > Cc: Mukul Joshi <Mukul.Joshi@xxxxxxx> > Signed-off-by: Kent Russell <kent.russell@xxxxxxx> > Acked-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > Reviewed-by: Luben Tuikov <luben.tuikov@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 ++++++++---- > 3 files changed, 10 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index d58e37fd01f4..b85b67a88a3d 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info; > extern int amdgpu_ras_enable; > extern uint amdgpu_ras_mask; > extern int amdgpu_bad_page_threshold; > +extern bool amdgpu_ignore_bad_page_threshold; > extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer; > extern int amdgpu_async_gfx_ring; > extern int amdgpu_mcbp; > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > index 96bd63aeeddd..eee3cf874e7a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > @@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444); > * result in the GPU entering bad status when the number of total > * faulty pages by ECC exceeds the threshold value. > */ > -MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement)"); > +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement, -2 = ignore bad page threshold)"); > module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444); > > MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)"); > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > index ce5089216474..bd6ed43b0df2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > @@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control, > res = amdgpu_ras_eeprom_correct_header_tag(control, > RAS_TABLE_HDR_VAL); > } else { > - *exceed_err_limit = true; > - dev_err(adev->dev, > - "RAS records:%d exceed threshold:%d, " > - "GPU will not be initialized. Replace this GPU or increase the threshold", > + dev_err(adev->dev, "RAS records:%d exceed threshold:%d", > control->ras_num_recs, ras->bad_page_cnt_threshold); I thought this would all go in a single set of patches. I wasn't aware a singleton patch went in already which changed just this line--this change was always a part of a patch set. Regards, Luben > + if (amdgpu_bad_page_threshold == -2) { > + dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2."); > + res = 0; > + } else { > + *exceed_err_limit = true; > + dev_err(adev->dev, "GPU will not be initialized. Replace this GPU or increase the threshold."); > + } > } > } else { > DRM_INFO("Creating a new EEPROM table");