[AMD Official Use Only - General] For the unlocking, I have tested on both nv21 and nv31, the unlock/lock paring looks not break. On asic <gfx11, the (!adev->gfx.is_poweron) always true, this parameter is introduced from GFX11. On gfx11, in the reset (suspend then resume) process, after suspend, gfx poweron right after smu resumed successfully. The (!adev->gfx.is_poweron) is always false when trylock the reset_domin->sem. Not return ahead in gfx11 and continue gpu tlb flush in IP specific (gmc v11) callback. Then unlock after gpu tlb flush: void amdgpu_gmc_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid, uint32_t vmhub, uint32_t flush_type) { if (!down_read_trylock(&adev->reset_domain->sem) && !adev->gfx.is_poweron) //lock return; ... adev->gmc.gmc_funcs->flush_gpu_tlb(adev, vmid, vmhub, flush_type); up_read(&adev->reset_domain->sem); //unlock return; } Thanks, Feifei -----Original Message----- From: Xu, Feifei Sent: Tuesday, October 10, 2023 5:44 PM To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx> Subject: RE: [PATCH] drm/amdgpu:Check gfx poweron when skip flush_gpu_tlb >> Then a TLB flush shouldn't be necessary on reset. A reset implies that the TLB is cleared as well. Hmm, in current implementation, when we say a reset implied that the TLB is cleared, assume that the TLB clear is purely hardware action. There's no gpu tlb flush initiated by software/driver after suspend. While in some asics of gfx11 (like nv31), gpu tlb need to be flushed by software/driver after smu resume successfully intentionally. Without the gpu tlb flush on nv31, S3 or reset will be break with gfx page fault. >> First of all the patch is broken because you only handle the locking, but not the unlocking part. For the unlocking part, realized that you and Kevin are correct. Lock/unlock not paried. Thanks, Feifei -----Original Message----- From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> Sent: Monday, October 9, 2023 4:52 PM To: Xu, Feifei <Feifei.Xu@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Koenig, Christian <Christian.Koenig@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx> Subject: Re: [PATCH] drm/amdgpu:Check gfx poweron when skip flush_gpu_tlb Am 09.10.23 um 03:50 schrieb Xu, Feifei: > [AMD Official Use Only - General] > > Hi, > >>> Based on your description, the above code should use "||" instead of >>> "&&", > && is to add more restriction here. To avoid skipping necessary TLB flush by return. > For Asics < GFX11, !adev->gfx.is_poweron is always true (this > paremeter is intrudoced from GFX11), only depends on reset_domain->sem; For Asics = GFX11, !adev->gfx.is_poweron might be false (which gfx might already poweron in the reset), this will make the if () not ture, return will not be executed, thus flush TLB. First of all the patch is broken because you only handle the locking, but not the unlocking part. Then a TLB flush shouldn't be necessary on reset. A reset implies that the TLB is cleared as well. We discussed the possibility to avoid that, but this is not supposed to be happening at the moment. Regards, Christian. > >>> And after merging code into one line may result in the lock not being released if the lock can be acquired success. > If !adev->gfx.is_poweron is true, the reset_domin->sem will not be down_read_trylock, thus could avoid this deadlock. > > Thanks, > Feifei > > -----Original Message----- > From: Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx> > Sent: Sunday, October 8, 2023 9:36 PM > To: Xu, Feifei <Feifei.Xu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Xu, Feifei <Feifei.Xu@xxxxxxx>; Xu, Feifei <Feifei.Xu@xxxxxxx>; > Koenig, Christian <Christian.Koenig@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx> > Subject: RE: [PATCH] drm/amdgpu:Check gfx poweron when skip > flush_gpu_tlb > > > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of > Feifei Xu > Sent: Sunday, October 8, 2023 6:07 PM > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Xu, Feifei <Feifei.Xu@xxxxxxx>; Xu, Feifei <Feifei.Xu@xxxxxxx>; > Koenig, Christian <Christian.Koenig@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx> > Subject: [PATCH] drm/amdgpu:Check gfx poweron when skip flush_gpu_tlb > > To fix the gpu recovery failed on GFX11 with gfxhub pagefault, flush gpu tlb after reset on GFX11. > Gfxhub tlb flush need check if adev->gfx.is_poweron set. > > Fixes: d0c860f33553 ("drm/amdgpu: rework lock handling for flush_tlb > v2") > > Signed-off-by: Feifei Xu <Feifei.Xu@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > index 2f9bb86edd71..048d32edee88 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c > @@ -611,8 +611,9 @@ void amdgpu_gmc_flush_gpu_tlb(struct amdgpu_device *adev, uint32_t vmid, > /* > * A GPU reset should flush all TLBs anyway, so no need to do > * this while one is ongoing. > + * For GFX11, gfxhub flush check if adev->gfx.is_poweron is set. > */ > - if (!down_read_trylock(&adev->reset_domain->sem)) > + if (!down_read_trylock(&adev->reset_domain->sem) && > +!adev->gfx.is_poweron) > return; > > [Kevin]: > Based on your description, the above code should use "||" instead of > "&&", And after merging code into one line may result in the lock not being released if the lock can be acquired success. > > Best Regards, > Kevin > > if (adev->gmc.flush_tlb_needs_extra_type_2) > -- > 2.34.1 >