Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2/14/2023 11:42 AM, Zhou1, Tao wrote:


-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Tuesday, February 14, 2023 12:55 PM
To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang,
Hawking <Hawking.Zhang@xxxxxxx>; Yang, Stanley
<Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice
<Candice.Li@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
bad page



On 2/14/2023 7:56 AM, Zhou1, Tao wrote:


-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Sent: Monday, February 13, 2023 8:38 PM
To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Yang, Stanley
<Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Li,
Candice <Candice.Li@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if
no new bad page



On 2/10/2023 2:15 PM, Tao Zhou wrote:
If a UMC bad page is reserved but not freed by an application, the
application may trigger uncorrectable error repeatly by accessing the page.


There is amdgpu_ras_check_bad_page which checks if address is already
part of an existing bad page. Can't that be used?

Thanks,
Lijo

[Tao] amdgpu_ras_check_bad_page is already called in
amdgpu_ras_add_bad_pages, this patch just makes use of the result of
amdgpu_ras_check_bad_page.


In the patch, below two are called after error count is set to 0.
	amdgpu_ras_save_bad_pages
	amdgpu_dpm_send_hbm_bad_pages_num

Instead of that, just check if it's an existing badpage which is repeatedly
accessed and proceed directly to the next step (reset if
specified)

	if (amdgpu_ras_check_bad_page(adev, address))
		set error count to 0;
		goto reset_logic;

Thanks,
Lijo

[Tao] 1. amdgpu_ras_check_bad_page checks only one page, but one ue can generate 16 or more pages.

In this case, it makes sense to do as per the patch. Thanks for the explanation.

Thanks,
Lijo

2. if no new bad page is found, amdgpu_ras_save_bad_pages will do nothing and ras_num_recs won't increase.
3. gpu reset logic should follow the old way even ue count is set to 0.

This patch only set ue count to 0 if there is no new bad page, but other logic has no change.



Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
    drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
    2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index e85c4689ce2c..eafe01a24349 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
amdgpu_device *adev,
    {
    	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
    	struct ras_err_handler_data *data;
-	int ret = 0;
+	int ret = 0, old_cnt;
    	uint32_t i;

    	if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
+2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
    	if (!data)
    		goto out;

+	old_cnt = data->count;
+
    	for (i = 0; i < pages; i++) {
    		if (amdgpu_ras_check_bad_page_unlock(con,
    			bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
@@ -2079,6
+2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
*adev,
    		data->count++;
    		data->space_left--;
    	}
+
+	/* all pages have been reserved before, no new bad page */
+	if (old_cnt == data->count)
+		ret = -EEXIST;
+
    out:
    	mutex_unlock(&con->recovery_lock);

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
index 1c7fcb4f2380..772c431e4065 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
@@ -145,8 +145,12 @@ static int
amdgpu_umc_do_page_retirement(struct
amdgpu_device *adev,

    		if ((amdgpu_bad_page_threshold != 0) &&
    			err_data->err_addr_cnt) {
-			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
+			ret = amdgpu_ras_add_bad_pages(adev, err_data-
err_addr,
    						err_data->err_addr_cnt);
+			/* if no new bad page is found, no need to increase ue
count */
+			if (ret == -EEXIST)
+				err_data->ue_count = 0;
+
    			amdgpu_ras_save_bad_pages(adev);

    			amdgpu_dpm_send_hbm_bad_pages_num(adev,
con->eeprom_control.ras_num_recs);



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux