On 2021-10-19 14:22, Russell, Kent wrote: > [AMD Official Use Only] > > > >> -----Original Message----- >> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >> Sent: Tuesday, October 19, 2021 2:09 PM >> To: Russell, Kent <Kent.Russell@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Cc: Tuikov, Luben <Luben.Tuikov@xxxxxxx>; Joshi, Mukul <Mukul.Joshi@xxxxxxx> >> Subject: Re: [PATCH 1/4] drm/amdgpu: Warn when bad pages approaches threshold >> >> Am 2021-10-19 um 1:50 p.m. schrieb Kent Russell: >>> Currently dmesg doesn't warn when the number of bad pages approaches the >>> threshold for page retirement. WARN when the number of bad pages >>> is at 90% or greater for easier checks and planning, instead of waiting >>> until the GPU is full of bad pages >>> >>> Cc: Luben Tuikov <luben.tuikov@xxxxxxx> >>> Cc: Mukul Joshi <Mukul.Joshi@xxxxxxx> >>> Signed-off-by: Kent Russell <kent.russell@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 10 ++++++++++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c >>> index 98732518543e..8270aad23a06 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c >>> @@ -1077,6 +1077,16 @@ int amdgpu_ras_eeprom_init(struct >> amdgpu_ras_eeprom_control *control, >>> if (res) >>> DRM_ERROR("RAS table incorrect checksum or error:%d\n", >>> res); >>> + >>> + /* threshold = -1 is automatic, threshold = 0 means that page >>> + * retirement is disabled. >>> + */ >>> + if (amdgpu_bad_page_threshold > 0 && >>> + control->ras_num_recs >= 0 && >>> + control->ras_num_recs >= (amdgpu_bad_page_threshold * 9 / 10)) >>> + DRM_WARN("RAS records:%u approaching threshold:%d", >>> + control->ras_num_recs, >>> + amdgpu_bad_page_threshold); >> This won't work for the default setting amdgpu_bad_page_threshold=-1. >> For this case, you'd have to take the threshold from >> ras->bad_page_cnt_threshold. > Yep, completely missed that. Thanks, I'll fix that up. Please also fix the round off, third conditional: a >= b * 9/10 <==> 10*a >= 9*b Then, you can also drop the second line, since from the first: b > 0 ==> 10*a >= 9*b > 0 ==> 10a > 0 ==> a > 0. Which shows that, b > 0 && 10*a >= 9*b is true iff a and b are both greater than 0, so you don't need the middle line of the check. Also in your message, say something like: DRM_WARN("RAS records:%u approaching a 90% threshold:%d", control->ras_num_recs, amdgpu_bad_page_threshold); Regards, Luben > > Kent >> Regards, >> Felix >> >> >>> } else if (hdr->header == RAS_TABLE_HDR_BAD && >>> amdgpu_bad_page_threshold != 0) { >>> res = __verify_ras_table_checksum(control);