Am 2022-04-19 um 12:01 schrieb Andrey Grodzovsky:
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -134,6 +134,7 @@ struct amdkfd_process_info {
/* MMU-notifier related fields */
atomic_t evicted_bos;
+atomic_t invalid;
struct delayed_work restore_userptr_work;
struct pid *pid;
};
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 99d2b15bcbf3..2a588eb9f456 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1325,6 +1325,7 @@ static int init_kfd_vm(struct amdgpu_vm *vm,
void **process_info,
info->pid = get_task_pid(current->group_leader, PIDTYPE_PID);
atomic_set(&info->evicted_bos, 0);
+atomic_set(&info->invalid, 0);
INIT_DELAYED_WORK(&info->restore_userptr_work,
amdgpu_amdkfd_restore_userptr_worker);
@@ -2693,6 +2694,9 @@ static void
amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
struct mm_struct *mm;
int evicted_bos;
+if (atomic_read(&process_info->invalid))
+return;
+
Probably better to again use drm_dev_enter/exit guard pair instead
of this flag.
I don’t know if I could use drm_dev_enter/exit efficiently because a
process can have multiple drm_dev open. And I don’t know how I can
recover/refer drm_dev(s) efficiently in the worker function in order
to use drm_dev_enter/exit.
I think that within the KFD code each kfd device belongs or points to
one specific drm_device so I don't think this is a problem.
Sorry, I haven't been following this discussion in all its details. But
I don't see why you need to check a flag in the worker. If the GPU is
unplugged you already cancel any pending work. How is new work getting
scheduled after the GPU is unplugged? Is it due to pending interrupts or
something? Can you instead invalidate process_info->restore_userptr_work
to prevent it from being scheduled again? Or add some check where it's
scheduling the work, instead of in the worker.
Regards,
Felix