I had noticed that all of these RAS messages use DRM instead of dev_warn. I wasn’t sure if there was a reason for that or not. It’s definitely inconsistent.
DRM_ERROR("Partial read for checksum, res:%d\n", res);
DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
DRM_ERROR("RAS table incorrect checksum or error:%d\n",
DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
DRM_ERROR("RAS Table incorrect checksum or error:%d\n",
dev_info(adev->dev,
"records:%d threshold:%d, resetting RAS table header signature",
dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
DRM_INFO("Creating a new EEPROM table");
Might be worth making a separate patch to handle those inconsistencies. I agree that device is useful for this kind of error/warning/info
Kent
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Thursday, October 21, 2021 12:31 PM
To: Russell, Kent <Kent.Russell@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Russell, Kent <Kent.Russell@xxxxxxx>; Tuikov, Luben <Luben.Tuikov@xxxxxxx>; Joshi, Mukul <Mukul.Joshi@xxxxxxx>
Subject: Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
[Public]
Nit pick - suggest to use dev_warn for easy identification of the device.
dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages.
Cc: Luben Tuikov <luben.tuikov@xxxxxxx>
Cc: Mukul Joshi <Mukul.Joshi@xxxxxxx>
Signed-off-by: Kent Russell <kent.russell@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..ce5089216474 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
if (res)
DRM_ERROR("RAS table incorrect checksum or error:%d\n",
res);
+
+ /* Warn if we are at 90% of the threshold or above */
+ if ((10 * control->ras_num_recs) >= (ras->bad_page_cnt_threshold * 9))
+ DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+ control->ras_num_recs,
+ ras->bad_page_cnt_threshold);
} else if (hdr->header == RAS_TABLE_HDR_BAD &&
amdgpu_bad_page_threshold != 0) {
res = __verify_ras_table_checksum(control);
--
2.25.1