On 2024-09-27 06:36, Lang Yu wrote:
dma_fence_get/put() should be called balanced in
init_kfd_vm() and amdgpu_amdkfd_gpuvm_destroy_cb().
I don't think that's correct. The reference taken in init_kfd_vm is
returned to the caller of amdgpu_amdkfd_gpuvm_acquire_process_vm, which
gets stored in the kfd_process structure. I think it's that caller's
responsibility to drop their reference. I think the real problem is,
that we're creating a new reference for each VM, but the kfd_process
structure is only one per process. So the RCU_INIT_POINTER(p->ef, ef);
in kfd_process_device_init_vm leaks the previous references.
Since we only need to get the eviction fence reference when creating the
first VM, I suggest this fix in kfd_process_device_init_vm:
ret = amdgpu_amdkfd_gpuvm_acquire_process_vm(dev->adev, avm,
&p->kgd_process_info,
- &ef);
+ p->ef ? NULL : &ef);
And in init_kfd_vm:
if (ef)
- *ef = dma_fence_get(&vm->process_info->eviction_fence->base);
+ *ef = dma_fence_get(&vm->process_info->eviction_fence->base);
Regards,
Felix
Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs")
Signed-off-by: Lang Yu <lang.yu@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index ce5ca304dba9..c3a4f8d297f7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1586,6 +1586,7 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
/* Update process info */
mutex_lock(&process_info->lock);
+ dma_fence_put(&process_info->eviction_fence->base);
process_info->n_vms--;
list_del(&vm->vm_list_node);
mutex_unlock(&process_info->lock);
@@ -1598,7 +1599,6 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
WARN_ON(!list_empty(&process_info->userptr_valid_list));
WARN_ON(!list_empty(&process_info->userptr_inval_list));
- dma_fence_put(&process_info->eviction_fence->base);
cancel_delayed_work_sync(&process_info->restore_userptr_work);
put_pid(process_info->pid);
mutex_destroy(&process_info->lock);