RE: [PATCH] drm/amdgpu: increase RAS bad page threshold

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Hawking Zhang <Hawking.Zhang@xxxxxxx>

Regards,
Hawking
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Tao Zhou
Sent: Thursday, March 6, 2025 14:11
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
Subject: [PATCH] drm/amdgpu: increase RAS bad page threshold

For default policy, driver will issue an RMA event when the number of bad pages is greater than 8 physical rows, rather than reaches 8 physical rows, don't rely on threshold configurable parameters in default mode.

Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index ab27cecb5519..09a6f8bc1a5a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -747,7 +747,7 @@ amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control)
        /* Modify the header if it exceeds.
         */
        if (amdgpu_bad_page_threshold != 0 &&
-           control->ras_num_bad_pages >= ras->bad_page_cnt_threshold) {
+           control->ras_num_bad_pages > ras->bad_page_cnt_threshold) {
                dev_warn(adev->dev,
                        "Saved bad pages %d reaches threshold value %d\n",
                        control->ras_num_bad_pages, ras->bad_page_cnt_threshold); @@ -806,7 +806,7 @@ amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control)
         */
        if (amdgpu_bad_page_threshold != 0 &&
            control->tbl_hdr.version == RAS_TABLE_VER_V2_1 &&
-           control->ras_num_bad_pages < ras->bad_page_cnt_threshold)
+           control->ras_num_bad_pages <= ras->bad_page_cnt_threshold)
                control->tbl_rai.health_percent = ((ras->bad_page_cnt_threshold -
                                                   control->ras_num_bad_pages) * 100) /
                                                   ras->bad_page_cnt_threshold;
@@ -1456,7 +1456,7 @@ int amdgpu_ras_eeprom_check(struct amdgpu_ras_eeprom_control *control)
                                res);
                        return -EINVAL;
                }
-               if (ras->bad_page_cnt_threshold > control->ras_num_bad_pages) {
+               if (ras->bad_page_cnt_threshold >= control->ras_num_bad_pages) {
                        /* This means that, the threshold was increased since
                         * the last time the system was booted, and now,
                         * ras->bad_page_cnt_threshold - control->num_recs > 0,
--
2.34.1





[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux