[AMD Official Use Only - General] Felix, Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉 >From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken. And I also try the debug_evictions parameter. No unexpected eviction shows anyway. I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it. So I would make a new patch to remove this duplicated ef removal. -----Original Message----- From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> Sent: Thursday, July 7, 2022 11:47 PM To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence Am 2022-07-07 um 05:54 schrieb Christian König: > Am 07.07.22 um 11:50 schrieb xinhui pan: >> Fence is accessed by dma_resv_add_fence() now. >> Use amdgpu_amdkfd_remove_eviction_fence instead. >> >> Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> index 0036c9e405af..1e25c400ce4f 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct >> amdgpu_device *adev, >> if (!process_info) >> return; >> - >> /* Release eviction fence from PD */ >> amdgpu_bo_reserve(pd, false); >> - amdgpu_bo_fence(pd, NULL, false); >> + amdgpu_amdkfd_remove_eviction_fence(pd, >> + process_info->eviction_fence); > > Good catch as well, but Felix needs to take a look at this. This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended. You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with echo Y > /sys/module/amdgpu/parameters/debug_evictions Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions. Regards, Felix > > Regards, > Christian. > >> amdgpu_bo_unreserve(pd); >> /* Update process info */ >