Xinhui submitted this patch instead, which should address the same
issue: "drm/amdgpu: Remove one duplicated ef removal"
Alex, can you pick up that patch for drm-fixes for 5.19, if it's not too
late?
Thanks,
Felix
On 2022-07-18 10:58, Mike Lothian wrote:
Is this likely to land before 5.19 final? It's been nearly 2 weeks
since I said if fixed an issue I was seeing
https://gitlab.freedesktop.org/drm/amd/-/issues/2074
On Fri, 8 Jul 2022 at 10:05, Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Hi guys,
well the practice to remove all fences by adding a NULL exclusive fence
was pretty much illegal in the first place because this also removes
kernel internal fences which can lead to freeing up memory which is
still accessed.
I've just didn't noticed that this was used by the KFD code as well
otherwise I would have pushed to clean that up much earlier.
Regards,
Christian.
Am 08.07.22 um 03:08 schrieb Pan, Xinhui:
[AMD Official Use Only - General]
Felix,
Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.
And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.
So I would make a new patch to remove this duplicated ef removal.
-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Thursday, July 7, 2022 11:47 PM
To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
Am 2022-07-07 um 05:54 schrieb Christian König:
Am 07.07.22 um 11:50 schrieb xinhui pan:
Fence is accessed by dma_resv_add_fence() now.
Use amdgpu_amdkfd_remove_eviction_fence instead.
Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 0036c9e405af..1e25c400ce4f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
amdgpu_device *adev,
if (!process_info)
return;
-
/* Release eviction fence from PD */
amdgpu_bo_reserve(pd, false);
- amdgpu_bo_fence(pd, NULL, false);
+ amdgpu_amdkfd_remove_eviction_fence(pd,
+ process_info->eviction_fence);
Good catch as well, but Felix needs to take a look at this.
This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.
You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with
echo Y > /sys/module/amdgpu/parameters/debug_evictions
Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.
Regards,
Felix
Regards,
Christian.
amdgpu_bo_unreserve(pd);
/* Update process info */