[AMD Official Use Only - General] > -----Original Message----- > From: Zhang, Hawking <Hawking.Zhang@xxxxxxx> > Sent: Friday, October 21, 2022 12:15 PM > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, > Stanley <Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx>; Li, > Candice <Candice.Li@xxxxxxx> > Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for > MCA > > [AMD Official Use Only - General] > > Re - whether need to do gpu reset is determined by unmap queue status, so > reset parameter can't be dropped > > + if (adev->gmc.xgmi.connected_to_cpu) { > + ret = amdgpu_umc_poison_handler_mca(adev, > ras_error_status, reset); > > I think amdgpu_umc_poison_handler_mca is fallback handler specific for MCA > platform, right? > > I noticed there is platform check already in amdgpu_umc_poison_handler with > the reset flag. so driver already knows whether the reset is needed, right. > That's why I think "ras_error_status", "reset" are all not necessary. You can > either call the followings directly by checking connected_to_cpu && reset, > > + kgd2kfd_set_sram_ecc_flag(adev->kfd.dev); > + amdgpu_ras_reset_gpu(adev); > > Or still provide a wrapper like amdgpu_umc_poison_handler_mca for above two > calls. > > The latter seems redundant as well. I mean, we don't need to maintain a specific > API for poison handling fallback on MCA platform - Aldebaran is the last SOC > that supports this A + A RAS design. I can confirm we'll move to a new design > going forward. > > Regards, > Hawking [Tao] adding amdgpu_umc_poison_handler_mca is for better extension, although it only calls gpu reset right now. But since A + A RAS design will change dramatically, I'll remove amdgpu_umc_poison_handler_mca as you suggested. > > -----Original Message----- > From: Zhou1, Tao <Tao.Zhou1@xxxxxxx> > Sent: Friday, October 21, 2022 10:54 > To: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; amd- > gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai, > Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx> > Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for > MCA > > [AMD Official Use Only - General] > > > > -----Original Message----- > > From: Zhang, Hawking <Hawking.Zhang@xxxxxxx> > > Sent: Thursday, October 20, 2022 5:13 PM > > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; > > Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai, Thomas > > <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx> > > Subject: RE: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions > > for MCA > > > > [AMD Official Use Only - General] > > > > +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev, > > + struct ras_err_data *err_data, bool reset) > > > > > > + if (adev->gmc.xgmi.connected_to_cpu) { > > + ret = amdgpu_umc_poison_handler_mca(adev, > > ras_error_status, reset); > > > > The input parameters "reset" and "err_data" can be dropped since > > amdgpu_umc_poison_handler_mca is dedicated to MCA platform > > [Tao] whether need to do gpu reset is determined by unmap queue status, so > reset parameter can't be dropped. > For "err_data", it can be removed currently, but I'm afraid we may need it on > other ASICs in the future. > > > > > Regards, > > Hawking > > > > -----Original Message----- > > From: Zhou1, Tao <Tao.Zhou1@xxxxxxx> > > Sent: Wednesday, October 19, 2022 16:12 > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang, Hawking > > <Hawking.Zhang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx>; Chai, > > Thomas <YiPeng.Chai@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx> > > Cc: Zhou1, Tao <Tao.Zhou1@xxxxxxx> > > Subject: [PATCH 1/4] drm/amdgpu: add RAS page retirement functions for > > MCA > > > > Define page retirement functions for MCA platform. > > > > v2: remove page retirement handling from MCA poison handler, > > let MCA notifier do page retirement. > > > > Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 67 > > +++++++++++++++++++++++++ > drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h > > | 2 + > > 2 files changed, 69 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > > index aad3c8b4c810..9494fa14db9a 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c > > @@ -22,6 +22,73 @@ > > */ > > > > #include "amdgpu.h" > > +#include "umc_v6_7.h" > > + > > +static int amdgpu_umc_convert_error_address(struct amdgpu_device *adev, > > + struct ras_err_data *err_data, uint64_t > > err_addr, > > + uint32_t ch_inst, uint32_t umc_inst) { > > + switch (adev->ip_versions[UMC_HWIP][0]) { > > + case IP_VERSION(6, 7, 0): > > + umc_v6_7_convert_error_address(adev, > > + err_data, err_addr, ch_inst, umc_inst); > > + break; > > + default: > > + dev_warn(adev->dev, > > + "UMC address to Physical address translation is not > > supported\n"); > > + return AMDGPU_RAS_FAIL; > > + } > > + > > + return AMDGPU_RAS_SUCCESS; > > +} > > + > > +int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev, > > + uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst) > > { > > + struct ras_err_data err_data = {0, 0, 0, NULL}; > > + int ret = AMDGPU_RAS_FAIL; > > + > > + err_data.err_addr = > > + kcalloc(adev->umc.max_ras_err_cnt_per_query, > > + sizeof(struct eeprom_table_record), GFP_KERNEL); > > + if (!err_data.err_addr) { > > + dev_warn(adev->dev, > > + "Failed to alloc memory for umc error record in MCA > > notifier!\n"); > > + return AMDGPU_RAS_FAIL; > > + } > > + > > + /* > > + * Translate UMC channel address to Physical address > > + */ > > + ret = amdgpu_umc_convert_error_address(adev, &err_data, err_addr, > > + ch_inst, umc_inst); > > + if (ret) > > + goto out; > > + > > + if (amdgpu_bad_page_threshold != 0) { > > + amdgpu_ras_add_bad_pages(adev, err_data.err_addr, > > + err_data.err_addr_cnt); > > + amdgpu_ras_save_bad_pages(adev); > > + } > > + > > +out: > > + kfree(err_data.err_addr); > > + return ret; > > +} > > + > > +static int amdgpu_umc_poison_handler_mca(struct amdgpu_device *adev, > > + struct ras_err_data *err_data, bool reset) { > > + /* MCA poison handler is only responsible for GPU reset, > > + * let MCA notifier do page retirement. > > + */ > > + if (reset) { > > + kgd2kfd_set_sram_ecc_flag(adev->kfd.dev); > > + amdgpu_ras_reset_gpu(adev); > > + } > > + > > + return AMDGPU_RAS_SUCCESS; > > +} > > > > static int amdgpu_umc_do_page_retirement(struct amdgpu_device *adev, > > void *ras_error_status, > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h > > index 3629d8f292ef..659a10de29c9 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h > > @@ -98,4 +98,6 @@ void amdgpu_umc_fill_error_record(struct > > ras_err_data *err_data, int amdgpu_umc_process_ras_data_cb(struct > amdgpu_device *adev, > > void *ras_error_status, > > struct amdgpu_iv_entry *entry); > > +int amdgpu_umc_page_retirement_mca(struct amdgpu_device *adev, > > + uint64_t err_addr, uint32_t ch_inst, uint32_t umc_inst); > > #endif > > -- > > 2.35.1