Re: [PATCH 1/1] drm/amdgpu: Fix double release KFD pasid

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2022-12-13 10:57, Christian König wrote:
Am 13.12.22 um 16:49 schrieb Philip Yang:
If amdgpu_amdkfd_gpuvm_acquire_process_vm returns failed after vm is
converted to KFD vm and vm->pasid set to KFD pasid, KFD will not
take pdd->drm_file reference, as a result, drm close file handler maybe
called to release the KFD pasid before KFD process destroy to release
the same pasid and set vm->pasid to zero, this generates below WARNING
backtrace and NULL pointer access.

Well NAK. If you fail after making the VM a compute VM the correct approach would be to drop this in the error handling again.

Since we don't need to reallocate anything that should also never fail.

I don't understand this comment.

The fundamental issue, as I understand it, is that compute VMs don't own their PASID. Multiple compute VMs on different GPUs share the same PASID. Therefore, freeing the PASID when the compute VM is destroyed is wrong. The PASID is freed by KFD when its process structure is destroyed.

Regards,
  Felix



Christian.


For compute process, KFD manage pasid and drm close file handler should
not release KFD pasid to avoid double release.

  amdgpu: Failed to create process VM object

  ida_free called for id=32770 which is not allocated.
  WARNING: CPU: 57 PID: 72542 at ../lib/idr.c:522 ida_free+0x96/0x140
  RIP: 0010:ida_free+0x96/0x140
  Call Trace:
   amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
   amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
   drm_file_free.part.13+0x216/0x270 [drm]
   drm_close_helper.isra.14+0x60/0x70 [drm]
   drm_release+0x6e/0xf0 [drm]
   __fput+0xcc/0x280
   ____fput+0xe/0x20
   task_work_run+0x96/0xc0
   do_exit+0x3d0/0xc10

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  RIP: 0010:ida_free+0x76/0x140
  Call Trace:
   amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
   amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
   drm_file_free.part.13+0x216/0x270 [drm]
   drm_close_helper.isra.14+0x60/0x70 [drm]
   drm_release+0x6e/0xf0 [drm]
   __fput+0xcc/0x280
   ____fput+0xe/0x20
   task_work_run+0x96/0xc0
   do_exit+0x3d0/0xc10

Suggested-by: Felix Kuehling <Felix.Kuehling@xxxxxxx>
Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 8 +++++++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index efc0a13e9aea..bf444c3f6656 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -1244,8 +1244,14 @@ void amdgpu_driver_postclose_kms(struct drm_device *dev,
          amdgpu_bo_unreserve(adev->virt.csa_obj);
      }
  -    pasid = fpriv->vm.pasid;
+    if (fpriv->vm.is_compute_context)
+        /* pasid managed by KFD is released when process is destroyed */
+        pasid = 0;
+    else
+        pasid = fpriv->vm.pasid;
+
      pd = amdgpu_bo_ref(fpriv->vm.root.bo);
+
      if (!WARN_ON(amdgpu_bo_reserve(pd, true))) {
          amdgpu_vm_bo_del(adev, fpriv->prt_va);
          amdgpu_bo_unreserve(pd);




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux