RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

"Zhang, Hawking" <Hawking.Zhang@xxxxxxx> · Fri, 21 Oct 2022 04:15:19 +0000

[AMD Official Use Only - General]

Re - whether need to do gpu reset is determined by unmap queue status, so reset parameter can't be dropped

+	if (adev->gmc.xgmi.connected_to_cpu) {
+		ret = amdgpu_umc_poison_handler_mca(adev, ras_error_status, reset);

I think amdgpu_umc_poison_handler_mca is fallback handler specific for MCA platform, right? 

I noticed there is platform check already in amdgpu_umc_poison_handler with the reset flag. so driver already knows whether the reset is needed, right.
That's why I think "ras_error_status", "reset" are all not necessary. You can either call the followings directly by checking connected_to_cpu && reset,

+		kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+		amdgpu_ras_reset_gpu(adev);

Or still provide a wrapper like amdgpu_umc_poison_handler_mca for above two calls.

The latter seems redundant as well. I mean, we don't need to maintain a specific API for poison handling fallback on MCA platform - Aldebaran is
the last SOC that supports this A + A RAS design. I can confirm we'll move to a new design going forward.

Regards,
Hawking

-----Original Message-----
From: Zhou1, Tao <Tao.Zhou1@xxxxxxx> 
Sent: Friday, October 21, 2022 10:54
To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

[AMD Official Use Only - General]

> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> Sent: Thursday, October 20, 2022 5:13 PM
> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; 
> Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai, Thomas 
> <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions 
> for MCA
> 
> [AMD Official Use Only - General]
> 
> +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
> +		struct ras_err_data *err_data, bool reset)
> 
> 
> +	if (adev->gmc.xgmi.connected_to_cpu) {
> +		ret = amdgpu_umc_poison_handler_mca(adev,
> ras_error_status, reset);
> 
> The input parameters "reset" and "err_data" can be dropped since 
> amdgpu_umc_poison_handler_mca is dedicated to MCA platform

[Tao] whether need to do gpu reset is determined by unmap queue status, so reset parameter can't be dropped.
For "err_data", it can be removed currently, but I'm afraid we may need it on other ASICs in the future.

> 
> Regards,
> Hawking
> 
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> Sent: Wednesday, October 19, 2022 16:12
> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang, Hawking 
> <Hawking.Zhang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai, 
> Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> Subject: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for 
> MCA
> 
> Define page retirement functions for MCA platform.
> 
> v2: remove page retirement handling from MCA poison handler,
>     let MCA notifier do page retirement.
> 
> Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 67
> +++++++++++++++++++++++++  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> |  2 +
>  2 files changed, 69 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index aad3c8b4c810..9494fa14db9a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -22,6 +22,73 @@
>   */
> 
>  #include "amdgpu.h"
> +#include "umc_v6_7.h"
> +
> +static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
> +				    struct ras_err_data *err_data, uint64_t
> err_addr,
> +				    uint32_t ch_inst, uint32_t umc_inst) {
> +	switch (adev->ip_versions[UMC_HWIP][0]) {
> +	case IP_VERSION(6, 7, 0):
> +		umc_v6_7_convert_error_address(adev,
> +				err_data, err_addr, ch_inst, umc_inst);
> +		break;
> +	default:
> +		dev_warn(adev->dev,
> +			 "UMC address to Physical address translation is not
> supported\n");
> +		return AMDGPU_RAS_FAIL;
> +	}
> +
> +	return AMDGPU_RAS_SUCCESS;
> +}
> +
> +int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
> +			uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst)
> {
> +	struct ras_err_data err_data = {0, 0, 0, NULL};
> +	int ret = AMDGPU_RAS_FAIL;
> +
> +	err_data.err_addr =
> +		kcalloc(adev->umc.max_ras_err_cnt_per_query,
> +			sizeof(struct eeprom_table_record), GFP_KERNEL);
> +	if (!err_data.err_addr) {
> +		dev_warn(adev->dev,
> +			"Failed to alloc memory for umc error record in MCA
> notifier!\n");
> +		return AMDGPU_RAS_FAIL;
> +	}
> +
> +	/*
> +	 * Translate UMC channel address to Physical address
> +	 */
> +	ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr,
> +					ch_inst, umc_inst);
> +	if (ret)
> +		goto out;
> +
> +	if (amdgpu_bad_page_threshold != 0) {
> +		amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> +						err_data.err_addr_cnt);
> +		amdgpu_ras_save_bad_pages(adev);
> +	}
> +
> +out:
> +	kfree(err_data.err_addr);
> +	return ret;
> +}
> +
> +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
> +		struct ras_err_data *err_data, bool reset) {
> +	/* MCA poison handler is only responsible for GPU reset,
> +	 * let MCA notifier do page retirement.
> +	 */
> +	if (reset) {
> +		kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> +		amdgpu_ras_reset_gpu(adev);
> +	}
> +
> +	return AMDGPU_RAS_SUCCESS;
> +}
> 
>  static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
>  		void *ras_error_status,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> index 3629d8f292ef..659a10de29c9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> @@ -98,4 +98,6 @@ void amdgpu_umc_fill_error_record(struct 
> ras_err_data *err_data,  int amdgpu_umc_process_ras_data_cb(struct amdgpu_device *adev,
>  		void *ras_error_status,
>  		struct amdgpu_iv_entry *entry);
> +int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
> +			uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst);
>  #endif
> --
> 2.35.1