[AMD Official Use Only - General] Instead of adding another function, why not make pages param as in/out. In = number of pages to be added Out = number of pages actually added. int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev, struct eeprom_table_record *bps, int *pages) Thanks, Lijo -----Original Message----- From: Zhou1, Tao <Tao.Zhou1@xxxxxxx> Sent: Friday, February 17, 2023 9:23 AM To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>; Lazar, Lijo <Lijo.Lazar@xxxxxxx> Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx> Subject: [PATCH 2/2] drm/amdgpu: exclude duplicate pages from UMC RAS UE count If a UMC bad page is reserved but not freed by an application, the application may trigger uncorrectable error repeatly by accessing the page. v2: add specific function to do the check. v3: remove duplicate pages, calculate new added bad page number. Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 23 +++++++++++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 2 ++ 3 files changed, 27 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 6e543558386d..777f85f3e5eb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -2115,6 +2115,29 @@ int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev) return 0; } +/* Remove duplicate pages, calculate new added bad page number. + * Note: the function should be called between amdgpu_ras_add_bad_pages + * and amdgpu_ras_save_bad_pages. + */ +int amdgpu_ras_umc_new_ue_count(struct amdgpu_device *adev) { + struct amdgpu_ras *con = amdgpu_ras_get_context(adev); + struct ras_err_handler_data *data; + struct amdgpu_ras_eeprom_control *control; + int save_count; + + if (!con || !con->eh_data) + return 0; + + mutex_lock(&con->recovery_lock); + control = &con->eeprom_control; + data = con->eh_data; + save_count = data->count - control->ras_num_recs; + mutex_unlock(&con->recovery_lock); + + return (save_count / adev->umc.retire_unit); } + /* * read error record array in eeprom and reserve enough space for * storing new bad pages diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h index f2ad999993f6..e89c95438a88 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h @@ -549,6 +549,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev, int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev); +int amdgpu_ras_umc_new_ue_count(struct amdgpu_device *adev); + static inline enum ta_ras_block amdgpu_ras_block_to_ta(enum amdgpu_ras_block block) { switch (block) { diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c index 1c7fcb4f2380..45b6be7277dd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c @@ -147,6 +147,8 @@ static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev, err_data->err_addr_cnt) { amdgpu_ras_add_bad_pages(adev, err_data->err_addr, err_data->err_addr_cnt); + err_data->ue_count = amdgpu_ras_umc_new_ue_count(adev); + amdgpu_ras_save_bad_pages(adev); amdgpu_dpm_send_hbm_bad_pages_num(adev, con->eeprom_control.ras_num_recs); -- 2.35.1