Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi guys,

well the practice to remove all fences by adding a NULL exclusive fence was pretty much illegal in the first place because this also removes kernel internal fences which can lead to freeing up memory which is still accessed.

I've just didn't noticed that this was used by the KFD code as well otherwise I would have pushed to clean that up much earlier.

Regards,
Christian.

Am 08.07.22 um 03:08 schrieb Pan, Xinhui:
[AMD Official Use Only - General]

Felix,
Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
 From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.

And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.

So I would make a new patch to remove this duplicated ef removal.

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Thursday, July 7, 2022 11:47 PM
To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence

Am 2022-07-07 um 05:54 schrieb Christian König:
Am 07.07.22 um 11:50 schrieb xinhui pan:
Fence is accessed by dma_resv_add_fence() now.
Use amdgpu_amdkfd_remove_eviction_fence instead.

Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 0036c9e405af..1e25c400ce4f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
amdgpu_device *adev,
         if (!process_info)
           return;
-
       /* Release eviction fence from PD */
       amdgpu_bo_reserve(pd, false);
-    amdgpu_bo_fence(pd, NULL, false);
+    amdgpu_amdkfd_remove_eviction_fence(pd,
+                    process_info->eviction_fence);
Good catch as well, but Felix needs to take a look at this.
This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.

You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with

      echo Y > /sys/module/amdgpu/parameters/debug_evictions

Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.

Regards,
    Felix


Regards,
Christian.

       amdgpu_bo_unreserve(pd);
         /* Update process info */




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux