> -----Original Message----- > From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> > Sent: Tuesday, February 14, 2023 12:55 PM > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang, > Hawking <Hawking.Zhang@xxxxxxx>; Yang, Stanley > <Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice > <Candice.Li@xxxxxxx> > Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new > bad page > > > > On 2/14/2023 7:56 AM, Zhou1, Tao wrote: > > > > > >> -----Original Message----- > >> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> > >> Sent: Monday, February 13, 2023 8:38 PM > >> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; > >> Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Yang, Stanley > >> <Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Li, > >> Candice <Candice.Li@xxxxxxx> > >> Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if > >> no new bad page > >> > >> > >> > >> On 2/10/2023 2:15 PM, Tao Zhou wrote: > >>> If a UMC bad page is reserved but not freed by an application, the > >>> application may trigger uncorrectable error repeatly by accessing the page. > >>> > >> > >> There is amdgpu_ras_check_bad_page which checks if address is already > >> part of an existing bad page. Can't that be used? > >> > >> Thanks, > >> Lijo > > > > [Tao] amdgpu_ras_check_bad_page is already called in > amdgpu_ras_add_bad_pages, this patch just makes use of the result of > amdgpu_ras_check_bad_page. > > > > In the patch, below two are called after error count is set to 0. > amdgpu_ras_save_bad_pages > amdgpu_dpm_send_hbm_bad_pages_num > > Instead of that, just check if it's an existing badpage which is repeatedly > accessed and proceed directly to the next step (reset if > specified) > > if (amdgpu_ras_check_bad_page(adev, address)) > set error count to 0; > goto reset_logic; > > Thanks, > Lijo [Tao] 1. amdgpu_ras_check_bad_page checks only one page, but one ue can generate 16 or more pages. 2. if no new bad page is found, amdgpu_ras_save_bad_pages will do nothing and ras_num_recs won't increase. 3. gpu reset logic should follow the old way even ue count is set to 0. This patch only set ue count to 0 if there is no new bad page, but other logic has no change. > > >> > >>> Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++- > >>> 2 files changed, 13 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >>> index e85c4689ce2c..eafe01a24349 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > >>> @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct > >> amdgpu_device *adev, > >>> { > >>> struct amdgpu_ras *con = amdgpu_ras_get_context(adev); > >>> struct ras_err_handler_data *data; > >>> - int ret = 0; > >>> + int ret = 0, old_cnt; > >>> uint32_t i; > >>> > >>> if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6 > >>> +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev, > >>> if (!data) > >>> goto out; > >>> > >>> + old_cnt = data->count; > >>> + > >>> for (i = 0; i < pages; i++) { > >>> if (amdgpu_ras_check_bad_page_unlock(con, > >>> bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT)) > >> @@ -2079,6 > >>> +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device > *adev, > >>> data->count++; > >>> data->space_left--; > >>> } > >>> + > >>> + /* all pages have been reserved before, no new bad page */ > >>> + if (old_cnt == data->count) > >>> + ret = -EEXIST; > >>> + > >>> out: > >>> mutex_unlock(&con->recovery_lock); > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > >>> index 1c7fcb4f2380..772c431e4065 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > >>> @@ -145,8 +145,12 @@ static int > amdgpu_umc_do_page_retirement(struct > >>> amdgpu_device *adev, > >>> > >>> if ((amdgpu_bad_page_threshold != 0) && > >>> err_data->err_addr_cnt) { > >>> - amdgpu_ras_add_bad_pages(adev, err_data->err_addr, > >>> + ret = amdgpu_ras_add_bad_pages(adev, err_data- > >>> err_addr, > >>> err_data->err_addr_cnt); > >>> + /* if no new bad page is found, no need to increase ue > >> count */ > >>> + if (ret == -EEXIST) > >>> + err_data->ue_count = 0; > >>> + > >>> amdgpu_ras_save_bad_pages(adev); > >>> > >>> amdgpu_dpm_send_hbm_bad_pages_num(adev, > >>> con->eeprom_control.ras_num_recs);