Re: [EXTERNAL] [PATCH 2/2] drm/amdkfd: Add PCIe Hotplug Support for AMDKFD

Felix Kuehling <felix.kuehling@xxxxxxx> · Tue, 19 Apr 2022 12:18:33 -0400

Am 2022-04-19 um 12:01 schrieb Andrey Grodzovsky:
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -134,6 +134,7 @@ struct amdkfd_process_info {
/* MMU-notifier related fields */
atomic_t evicted_bos;
+atomic_t invalid;
struct delayed_work restore_userptr_work;
struct pid *pid;
 };

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 99d2b15bcbf3..2a588eb9f456 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm, 
void **process_info,
info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
atomic_set(&info->evicted_bos, 0);
+atomic_set(&info->invalid, 0);
INIT_DELAYED_WORK(&info->restore_userptr_work,
 amdgpu_amdkfd_restore_userptr_worker);
@@ -2693,6 +2694,9 @@ static void 
amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
struct mm_struct *mm;
int evicted_bos;
+if (atomic_read(&process_info->invalid))
+return;
+


Probably better  to again use drm_dev_enter/exit guard pair instead 
of this flag.



I don’t know if I could use drm_dev_enter/exit efficiently because a 
process can have multiple drm_dev open. And I don’t know how I can 
recover/refer drm_dev(s) efficiently in the worker function in order 
to use drm_dev_enter/exit.


I think that within the KFD code each kfd device belongs or points to 
one specific drm_device so I don't think this is a problem.

Sorry, I haven't been following this discussion in all its details. But 
I don't see why you need to check a flag in the worker. If the GPU is 
unplugged you already cancel any pending work. How is new work getting 
scheduled after the GPU is unplugged? Is it due to pending interrupts or 
something? Can you instead invalidate process_info->restore_userptr_work 
to prevent it from being scheduled again? Or add some check where it's 
scheduling the work, instead of in the worker.

Regards,
  Felix