>>I guess you are referring to the below corner case >> 1) Place NPS request >> 2) Unload Driver >> 3) Reinstall driver with a different TOS (possible but quite unlikely) >> 4) Driver reload >> 5) Driver checks TOS version first and goes for a reset >> 6) reset_flag of GMC is not set, hence it doesn't refresh the NPS range.
>>I think changing the order in soc15_need_reset_on_init() to check for NPS request before TOS version check will solve this. Yes, I was thinking of reset_flag and tOS
reloading (adev->init_lvl->level set to
AMDGPU_INIT_LEVEL_MINIMAL_XGMI) changing at the same time. And
NPS refresh will be ignored. Though might be likely in debugging
or regression isolation cases which changing driver packaged
with different TOS. And yes making NPS refresh checking before
TOS version checking will solve this. And if we do not return ahead when checking NPS request before
tOS version change in soc15_need_reset_on_init(), we can drop
(adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL_XGMI)
check in below refresh checking: + refresh = (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL_XGMI) && + (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS); refresh = (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS); Thanks Feifei On 9/26/2024 5:01 PM, Xu, Feifei wrote:
+ refresh = (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL_XGMI) && + (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS);Is there a corner case that reloading with a different version tos and refreshing nps change co-exist? Thanks, Feifei -----Original Message----- From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Lijo Lazar Sent: Tuesday, September 24, 2024 1:57 PM To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Bhardwaj, Rajneesh <Rajneesh.Bhardwaj@xxxxxxx>; Errabolu, Ramesh <Ramesh.Errabolu@xxxxxxx> Subject: [PATCH 6/7] drm/amdgpu: Check gmc requirement for reset on init Add a callback to check if there is any condition detected by GMC block for reset on init. One case is if a pending NPS change request is detected. If reset is done because of NPS switch, refresh NPS info from discovery table. Signed-off-by: Lijo Lazar <lijo.lazar@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 13 ++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 5 +++++ drivers/gpu/drm/amd/amdgpu/soc15.c | 2 ++ 3 files changed, 19 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index 21f1e65c9dc9..011fe3a847d0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c @@ -1261,12 +1261,15 @@ int amdgpu_gmc_get_nps_memranges(struct amdgpu_device *adev, struct amdgpu_gmc_memrange *ranges; int range_cnt, ret, i, j; uint32_t nps_type; + bool refresh; if (!mem_ranges) return -EINVAL; + refresh = (adev->init_lvl->level != AMDGPU_INIT_LEVEL_MINIMAL_XGMI) && + (adev->gmc.reset_flags & AMDGPU_GMC_INIT_RESET_NPS); ret = amdgpu_discovery_get_nps_info(adev, &nps_type, &ranges, - &range_cnt, false); + &range_cnt, refresh); if (ret) return ret; @@ -1392,3 +1395,11 @@ void amdgpu_gmc_prepare_nps_mode_change(struct amdgpu_device *adev) adev->dev, "NPS mode change request done, reload driver to complete the change\n"); } + +bool amdgpu_gmc_need_reset_on_init(struct amdgpu_device *adev) { + if (adev->gmc.gmc_funcs->need_reset_on_init) + return adev->gmc.gmc_funcs->need_reset_on_init(adev); + + return false; +} diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index b13d6adb5efd..d4cd247fe574 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h @@ -78,6 +78,8 @@ enum amdgpu_memory_partition { BIT(AMDGPU_NPS3_PARTITION_MODE) | BIT(AMDGPU_NPS4_PARTITION_MODE) | \ BIT(AMDGPU_NPS6_PARTITION_MODE) | BIT(AMDGPU_NPS8_PARTITION_MODE)) +#define AMDGPU_GMC_INIT_RESET_NPS BIT(0) + /* * GMC page fault information */ @@ -169,6 +171,7 @@ struct amdgpu_gmc_funcs { /* Request NPS mode */ int (*request_mem_partition_mode)(struct amdgpu_device *adev, int nps_mode); + bool (*need_reset_on_init)(struct amdgpu_device *adev); }; struct amdgpu_xgmi_ras { @@ -314,6 +317,7 @@ struct amdgpu_gmc { const struct amdgpu_gmc_funcs *gmc_funcs; enum amdgpu_memory_partition requested_nps_mode; uint32_t supported_nps_modes; + uint32_t reset_flags; struct amdgpu_xgmi xgmi; struct amdgpu_irq_src ecc_irq; @@ -468,5 +472,6 @@ int amdgpu_gmc_get_nps_memranges(struct amdgpu_device *adev, int amdgpu_gmc_request_memory_partition(struct amdgpu_device *adev, int nps_mode); void amdgpu_gmc_prepare_nps_mode_change(struct amdgpu_device *adev); +bool amdgpu_gmc_need_reset_on_init(struct amdgpu_device *adev); #endif diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index 619933f252aa..97ca4931a7ef 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -833,6 +833,8 @@ static bool soc15_need_reset_on_init(struct amdgpu_device *adev) if (amdgpu_psp_tos_reload_needed(adev)) return true; + if (amdgpu_gmc_need_reset_on_init(adev)) + return true; /* Just return false for soc15 GPUs. Reset does not seem to * be necessary. */ -- 2.25.1 |