On 2/14/2023 7:56 AM, Zhou1, Tao wrote:
-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Monday, February 13, 2023 8:38 PM
To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang,
Hawking <Hawking.Zhang@xxxxxxx>; Yang, Stanley
<Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice
<Candice.Li@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
bad page
On 2/10/2023 2:15 PM, Tao Zhou wrote:
If a UMC bad page is reserved but not freed by an application, the
application may trigger uncorrectable error repeatly by accessing the page.
There is amdgpu_ras_check_bad_page which checks if address is already part of
an existing bad page. Can't that be used?
Thanks,
Lijo
[Tao] amdgpu_ras_check_bad_page is already called in amdgpu_ras_add_bad_pages, this patch just makes use of the result of amdgpu_ras_check_bad_page.
In the patch, below two are called after error count is set to 0.
amdgpu_ras_save_bad_pages
amdgpu_dpm_send_hbm_bad_pages_num
Instead of that, just check if it's an existing badpage which is
repeatedly accessed and proceed directly to the next step (reset if
specified)
if (amdgpu_ras_check_bad_page(adev, address))
set error count to 0;
goto reset_logic;
Thanks,
Lijo
Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
2 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index e85c4689ce2c..eafe01a24349 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
amdgpu_device *adev,
{
struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
struct ras_err_handler_data *data;
- int ret = 0;
+ int ret = 0, old_cnt;
uint32_t i;
if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6 +2060,8
@@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
if (!data)
goto out;
+ old_cnt = data->count;
+
for (i = 0; i < pages; i++) {
if (amdgpu_ras_check_bad_page_unlock(con,
bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
@@ -2079,6
+2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
data->count++;
data->space_left--;
}
+
+ /* all pages have been reserved before, no new bad page */
+ if (old_cnt == data->count)
+ ret = -EEXIST;
+
out:
mutex_unlock(&con->recovery_lock);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 1c7fcb4f2380..772c431e4065 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -145,8 +145,12 @@ static int amdgpu_umc_do_page_retirement(struct
amdgpu_device *adev,
if ((amdgpu_bad_page_threshold != 0) &&
err_data->err_addr_cnt) {
- amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
+ ret = amdgpu_ras_add_bad_pages(adev, err_data-
err_addr,
err_data->err_addr_cnt);
+ /* if no new bad page is found, no need to increase ue
count */
+ if (ret == -EEXIST)
+ err_data->ue_count = 0;
+
amdgpu_ras_save_bad_pages(adev);
amdgpu_dpm_send_hbm_bad_pages_num(adev,
con->eeprom_control.ras_num_recs);