RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for MCA

"Zhou1, Tao" <Tao.Zhou1@xxxxxxx> · Fri, 21 Oct 2022 06:27:59 +0000

[AMD Official Use Only - General]

> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> Sent: Friday, October 21, 2022 12:15 PM
> To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang,
> Stanley <Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Li,
> Candice <Candice.Li@xxxxxxx>
> Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for
> MCA
> 
> [AMD Official Use Only - General]
> 
> Re - whether need to do gpu reset is determined by unmap queue status, so
> reset parameter can't be dropped
> 
> +	if (adev->gmc.xgmi.connected_to_cpu) {
> +		ret = amdgpu_umc_poison_handler_mca(adev,
> ras_error_status, reset);
> 
> I think amdgpu_umc_poison_handler_mca is fallback handler specific for MCA
> platform, right?
> 
> I noticed there is platform check already in amdgpu_umc_poison_handler with
> the reset flag. so driver already knows whether the reset is needed, right.
> That's why I think "ras_error_status", "reset" are all not necessary. You can
> either call the followings directly by checking connected_to_cpu && reset,
> 
> +		kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> +		amdgpu_ras_reset_gpu(adev);
> 
> Or still provide a wrapper like amdgpu_umc_poison_handler_mca for above two
> calls.
> 
> The latter seems redundant as well. I mean, we don't need to maintain a specific
> API for poison handling fallback on MCA platform - Aldebaran is the last SOC
> that supports this A + A RAS design. I can confirm we'll move to a new design
> going forward.
> 
> Regards,
> Hawking

[Tao] adding amdgpu_umc_poison_handler_mca is for better extension, although it only calls gpu reset right now. But since A + A RAS design will change dramatically, I'll remove amdgpu_umc_poison_handler_mca as you suggested.

> 
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> Sent: Friday, October 21, 2022 10:54
> To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd-
> gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai,
> Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for
> MCA
> 
> [AMD Official Use Only - General]
> 
> 
> > -----Original Message-----
> > From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
> > Sent: Thursday, October 20, 2022 5:13 PM
> > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
> > Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai, Thomas
> > <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> > Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions
> > for MCA
> >
> > [AMD Official Use Only - General]
> >
> > +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
> > +		struct ras_err_data *err_data, bool reset)
> >
> >
> > +	if (adev->gmc.xgmi.connected_to_cpu) {
> > +		ret = amdgpu_umc_poison_handler_mca(adev,
> > ras_error_status, reset);
> >
> > The input parameters "reset" and "err_data" can be dropped since
> > amdgpu_umc_poison_handler_mca is dedicated to MCA platform
> 
> [Tao] whether need to do gpu reset is determined by unmap queue status, so
> reset parameter can't be dropped.
> For "err_data", it can be removed currently, but I'm afraid we may need it on
> other ASICs in the future.
> 
> >
> > Regards,
> > Hawking
> >
> > -----Original Message-----
> > From: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> > Sent: Wednesday, October 19, 2022 16:12
> > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang, Hawking
> > <Hawking.Zhang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai,
> > Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>
> > Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx>
> > Subject: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for
> > MCA
> >
> > Define page retirement functions for MCA platform.
> >
> > v2: remove page retirement handling from MCA poison handler,
> >     let MCA notifier do page retirement.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 67
> > +++++++++++++++++++++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > |  2 +
> >  2 files changed, 69 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > index aad3c8b4c810..9494fa14db9a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> > @@ -22,6 +22,73 @@
> >   */
> >
> >  #include "amdgpu.h"
> > +#include "umc_v6_7.h"
> > +
> > +static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev,
> > +				    struct ras_err_data *err_data, uint64_t
> > err_addr,
> > +				    uint32_t ch_inst, uint32_t umc_inst) {
> > +	switch (adev->ip_versions[UMC_HWIP][0]) {
> > +	case IP_VERSION(6, 7, 0):
> > +		umc_v6_7_convert_error_address(adev,
> > +				err_data, err_addr, ch_inst, umc_inst);
> > +		break;
> > +	default:
> > +		dev_warn(adev->dev,
> > +			 "UMC address to Physical address translation is not
> > supported\n");
> > +		return AMDGPU_RAS_FAIL;
> > +	}
> > +
> > +	return AMDGPU_RAS_SUCCESS;
> > +}
> > +
> > +int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
> > +			uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst)
> > {
> > +	struct ras_err_data err_data = {0, 0, 0, NULL};
> > +	int ret = AMDGPU_RAS_FAIL;
> > +
> > +	err_data.err_addr =
> > +		kcalloc(adev->umc.max_ras_err_cnt_per_query,
> > +			sizeof(struct eeprom_table_record), GFP_KERNEL);
> > +	if (!err_data.err_addr) {
> > +		dev_warn(adev->dev,
> > +			"Failed to alloc memory for umc error record in MCA
> > notifier!\n");
> > +		return AMDGPU_RAS_FAIL;
> > +	}
> > +
> > +	/*
> > +	 * Translate UMC channel address to Physical address
> > +	 */
> > +	ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr,
> > +					ch_inst, umc_inst);
> > +	if (ret)
> > +		goto out;
> > +
> > +	if (amdgpu_bad_page_threshold != 0) {
> > +		amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> > +						err_data.err_addr_cnt);
> > +		amdgpu_ras_save_bad_pages(adev);
> > +	}
> > +
> > +out:
> > +	kfree(err_data.err_addr);
> > +	return ret;
> > +}
> > +
> > +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev,
> > +		struct ras_err_data *err_data, bool reset) {
> > +	/* MCA poison handler is only responsible for GPU reset,
> > +	 * let MCA notifier do page retirement.
> > +	 */
> > +	if (reset) {
> > +		kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> > +		amdgpu_ras_reset_gpu(adev);
> > +	}
> > +
> > +	return AMDGPU_RAS_SUCCESS;
> > +}
> >
> >  static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev,
> >  		void *ras_error_status,
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > index 3629d8f292ef..659a10de29c9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> > @@ -98,4 +98,6 @@ void amdgpu_umc_fill_error_record(struct
> > ras_err_data *err_data,  int amdgpu_umc_process_ras_data_cb(struct
> amdgpu_device *adev,
> >  		void *ras_error_status,
> >  		struct amdgpu_iv_entry *entry);
> > +int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev,
> > +			uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst);
> >  #endif
> > --
> > 2.35.1