[AMD Official Use Only - AMD Internal Distribution Only] ----------------- Best Regards, Thomas -----Original Message----- From: Zhou1, Tao <Tao.Zhou1@xxxxxxx> Sent: Thursday, July 4, 2024 11:40 AM To: Chai, Thomas <YiPeng.Chai@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; Yang, Stanley <Stanley.Yang@xxxxxxx> Subject: RE: [PATCH 2/2] drm/amdgpu: timely save bad pages to eeprom after gpu ras reset is complete [AMD Official Use Only - AMD Internal Distribution Only] > -----Original Message----- > From: Chai, Thomas <YiPeng.Chai@xxxxxxx> > Sent: Wednesday, July 3, 2024 4:41 PM > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Zhou1, Tao > <Tao.Zhou1@xxxxxxx>; Li, Candice <Candice.Li@xxxxxxx>; Wang, > Yang(Kevin) <KevinYang.Wang@xxxxxxx>; Yang, Stanley > <Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx> > Subject: [PATCH 2/2] drm/amdgpu: timely save bad pages to eeprom after > gpu ras reset is complete > > The problem case is as follows: > 1. GPU A triggers a gpu ras reset, and GPU A drives > GPU B to also perform a gpu ras reset. > 2. After gpu B ras reset started, gpu B queried a DE > data. Since the DE data was queried in the ras reset > thread instead of the page retirement thread, bad > page retirement work would not be triggered. Then > even if all gpu resets are completed, the bad pages > will be cached in RAM until GPU B's bad page retirement > work is triggered again and then saved to eeprom. >[Tao] can we add this description to code comment? [Thomas] OK > > This patch can save the bad pages to eeprom in time after gpu ras > reset is complete. > > Signed-off-by: YiPeng Chai <YiPeng.Chai@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 14 +++++++++++++- > drivers/gpu/drm/amd/amdgpu/umc_v12_0.c | 6 ++++++ > 2 files changed, 19 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index 1b6f5b26957b..b6e047a354a2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -2844,8 +2844,20 @@ static void > amdgpu_ras_do_page_retirement(struct > work_struct *work) > struct ras_err_data err_data; > unsigned long err_cnt; > > - if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) > + if (amdgpu_in_reset(adev) || amdgpu_ras_in_recovery(adev)) { > + int ret; > + > + mutex_lock(&con->umc_ecc_log.lock); > + ret = radix_tree_tagged(&con->umc_ecc_log.de_page_tree, > + UMC_ECC_NEW_DETECTED_TAG); > + mutex_unlock(&con->umc_ecc_log.lock); > + > + /* If gpu reset is not completed, schedule delayed work again */ > + if (ret) > + schedule_delayed_work(&con- > >page_retirement_dwork, > + > msecs_to_jiffies(AMDGPU_RAS_RETIRE_PAGE_INTERVAL * 3)); > [Tao] this section of code can be put in a function to make code reusable. [Thomas] OK > return; > + } > > amdgpu_ras_error_data_init(&err_data); > > diff --git a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c > b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c > index 0faa21d8a7b4..7bdba5532adb 100644 > --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c > @@ -29,6 +29,7 @@ > #include "mp/mp_13_0_6_sh_mask.h" > > #define MAX_ECC_NUM_PER_RETIREMENT 32 > +#define DELAYED_TIME_FOR_GPU_RESET 1000 //ms > > static inline uint64_t get_umc_v12_0_reg_offset(struct amdgpu_device *adev, > uint32_t node_inst, @@ > -568,6 +569,11 @@ static int umc_v12_0_update_ecc_status(struct > amdgpu_device *adev, > > con->umc_ecc_log.de_queried_count++; > > + /* Try to retire the bad pages detected after gpu ras reset started */ > + if (amdgpu_ras_in_recovery(adev)) > + schedule_delayed_work(&con->page_retirement_dwork, > + msecs_to_jiffies(DELAYED_TIME_FOR_GPU_RESET)); > + > return 0; > } > > -- > 2.34.1