[AMD Official Use Only] > -----Original Message----- > From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > Sent: Tuesday, October 19, 2021 2:09 PM > To: Russell, Kent <Kent.Russell@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Tuikov, Luben <Luben.Tuikov@xxxxxxx>; Joshi, Mukul <Mukul.Joshi@xxxxxxx> > Subject: Re: [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold > > Am 2021-10-19 um 1:50 p.m. schrieb Kent Russell: > > Currently dmesg doesn't warn when the number of bad pages approaches the > > threshold for page retirement. WARN when the number of bad pages > > is at 90% or greater for easier checks and planning, instead of waiting > > until the GPU is full of bad pages > > > > Cc: Luben Tuikov <luben.tuikov@xxxxxxx> > > Cc: Mukul Joshi <Mukul.Joshi@xxxxxxx> > > Signed-off-by: Kent Russell <kent.russell@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > index 98732518543e..8270aad23a06 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c > > @@ -1077,6 +1077,16 @@ int amdgpu_ras_eeprom_init(struct > amdgpu_ras_eeprom_control *control, > > if (res) > > DRM_ERROR("RAS table incorrect checksum or error:%d\n", > > res); > > + > > + /* threshold = -1 is automatic, threshold = 0 means that page > > + * retirement is disabled. > > + */ > > + if (amdgpu_bad_page_threshold > 0 && > > + control->ras_num_recs >= 0 && > > + control->ras_num_recs >= (amdgpu_bad_page_threshold * 9 / 10)) > > + DRM_WARN("RAS records:%u approaching threshold:%d", > > + control->ras_num_recs, > > + amdgpu_bad_page_threshold); > > This won't work for the default setting amdgpu_bad_page_threshold=-1. > For this case, you'd have to take the threshold from > ras->bad_page_cnt_threshold. Yep, completely missed that. Thanks, I'll fix that up. Kent > > Regards, > Felix > > > > } else if (hdr->header == RAS_TABLE_HDR_BAD && > > amdgpu_bad_page_threshold != 0) { > > res = __verify_ras_table_checksum(control);