On 2021-10-20 17:50, Felix Kuehling wrote: > On 2021-10-20 12:35 p.m., Kent Russell wrote: >> Currently dmesg doesn't warn when the number of bad pages approaches the >> threshold for page retirement. WARN when the number of bad pages >> is at 90% or greater for easier checks and planning, instead of waiting >> until the GPU is full of bad pages >> >> Cc: Luben Tuikov <luben.tuikov@xxxxxxx> >> Cc: Mukul Joshi <Mukul.Joshi@xxxxxxx> >> Signed-off-by: Kent Russell <kent.russell@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 17 +++++++++++++++++ >> 1 file changed, 17 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c >> index f4c05ff4b26c..1ede0f0d6f55 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c >> @@ -1071,12 +1071,29 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control, >> control->ras_fri = RAS_OFFSET_TO_INDEX(control, hdr->first_rec_offset); >> >> if (hdr->header == RAS_TABLE_HDR_VAL) { >> + int threshold = 0; > ras->bad_page_cnt_threshold is uint32_t. I'd recommend using the same > type. Also add an empty line after the declaration to avoid a checkpatch > warning. > > >> DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records", >> control->ras_num_recs); >> res = __verify_ras_table_checksum(control); >> if (res) >> DRM_ERROR("RAS table incorrect checksum or error:%d\n", >> res); >> + >> + /* threshold = 0 means that page retirement is disabled, while >> + * threshold = -1 means default behaviour >> + */ >> + if (amdgpu_bad_page_threshold == -1) >> + threshold = ras->bad_page_cnt_threshold; >> + else if (amdgpu_bad_page_threshold > 0) >> + threshold = amdgpu_bad_page_threshold; >> + >> + /* Since multiplcation is transitive, a = 9b/10 is the same >> + * as 10a = 9b. Use this for our 90% limit to avoid rounding >> + */ >> + if (threshold > 0 && ((control->ras_num_recs * 10) >= (threshold * 9))) > Not sure how big these values can get, but you may need to cast to > (uint64_t) before the multiplications to avoid overflows. Alternatively > you could use (control->ras_num_recs / 9 >= threshold / 10). It'll > round, but never overflow. I sincerely hope that by the time those values overflow multiplication by 10, AI has taken over the planet. :-) Avoiding rounding is preferable, since we deal with integers, and for small pages... could we get 0s after division? (if the page limit is 1 :-) ) (I think squashing numbers less than 10 to 0 is a bad idea, so as not to get false positives here on small numbers.) Even for a 32-bit word size: can we be within 3 bits of 2^32? A value of 2^29? That's a lot of pages! Actually the number of pages of VRAM is *a lot smaller* than the number of pages we fit in a typical EEPROM we put with our parts. I think we're safe. > > >> + DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d", > Nitpick: I'd add space after the two colons for readability. The > threshold should use %u if you make it uint32_t. This can never be negative. Aha, so that's exactly what I like "word:%q word:%q word:%q", so when I read it in the log, my eyes scan over it, on the spaces in between the word blocks. I prefer no spaces after the colon, so as to make scanning blocks (one has to squint to see it). Regards, Luben > > Regards, > Felix > > >> + control->ras_num_recs, >> + threshold); >> } else if (hdr->header == RAS_TABLE_HDR_BAD && >> amdgpu_bad_page_threshold != 0) { >> res = __verify_ras_table_checksum(control);