Re: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module exit.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Am 2023-03-06 um 16:58 schrieb David Belanger:
Handle case when module is unloaded (kfd_exit) before a process space
(mm_struct) is released.

Signed-off-by: David Belanger <david.belanger@xxxxxxx>
---
  drivers/gpu/drm/amd/amdkfd/kfd_module.c  |  4 ++
  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 57 ++++++++++++++++++++++++
  2 files changed, 61 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_module.c b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
index 09b966dc3768..8ef4bd9e4f7d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_module.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_module.c
@@ -26,6 +26,9 @@
  #include "kfd_priv.h"
  #include "amdgpu_amdkfd.h"
+void kfd_cleanup_processes(void);
+
+
  static int kfd_init(void)
  {
  	int err;
@@ -77,6 +80,7 @@ static int kfd_init(void)
static void kfd_exit(void)
  {
+	kfd_cleanup_processes();
  	kfd_debugfs_fini();
  	kfd_process_destroy_wq();
  	kfd_procfs_shutdown();
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index ebabe92f7edb..b5b28a32639d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1181,6 +1181,17 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn,
  		return;
mutex_lock(&kfd_processes_mutex);
+	/*
+	 * Do early return if p is not in the table.
+	 *
+	 * This could potentially happen if this function is called concurrently
+	 * by mmu_notifier and by kfd_cleanup_pocesses.
+	 *
+	 */
+	if (!hash_hashed(&p->kfd_processes)) {
+		mutex_unlock(&kfd_processes_mutex);
+		return;
+	}
  	hash_del_rcu(&p->kfd_processes);
  	mutex_unlock(&kfd_processes_mutex);
  	synchronize_srcu(&kfd_processes_srcu);
@@ -1200,6 +1211,52 @@ static const struct mmu_notifier_ops kfd_process_mmu_notifier_ops = {
  	.free_notifier = kfd_process_free_notifier,
  };
+
+void kfd_cleanup_processes(void)
+{
+	struct kfd_process *p;
+	unsigned int temp;
+
+	/*
+	 * Iterate over remaining processes in table, calling notifier release
+	 * to free up notifier and process resources.
+	 *
+	 * This code handles the case when driver is unloaded before all mm_struct
+	 * are released.
+	 */
+	int idx = srcu_read_lock(&kfd_processes_srcu);
+
+	hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) {
+		if (p) {
+			/*
+			 * Obtain a reference on p to avoid a late mmu_notifier release
+			 * call triggering freeing the process.
+			 */
+
+			kref_get(&p->ref);
+
+			srcu_read_unlock(&kfd_processes_srcu, idx);

I don't think it's valid to drop the lock in the middle of the loop. You need to hold the lock throughout the loop to protect the consistency of the data structure. I guess you're doing this because you got a deadlock from synchronize_srcu in kfd_process_notifier_release.


+
+			kfd_process_notifier_release(&p->mmu_notifier, p->mm);

This calls hash_del_rcu to remove the process from the hash table. To make this safe, you need to hold the kfd_processes_mutex.

Since this is outside the RCU read lock, the entry in the hlist can be freed, which can cause problems when the hash_for_each_rcu loop tries to find the next entry in the hlist.


+
+			kfd_unref_process(p);

This schedules a worker that can free the process at any time, which also frees the hlist_node p->kfd_processes that is still needed by hash_for_each_rcu to find the next entry. If you're unlucky, the worker will be scheduled before the next loop iteration, and you can get a kernel oops.

I suggest a safer strategy: Make a loop using hash_for_each_safe to move all the processes into a new hlist. You can do that while holding the kfd_processes_mutex, so you can safely remove all entries from the hash table and move them into your own private hlist.

Then you can safely release all the processes from your private hlist without having to hold either the srcu read lock or the mutex.

Regards,
  Felix


+
+			idx = srcu_read_lock(&kfd_processes_srcu);
+		}
+	}
+	srcu_read_unlock(&kfd_processes_srcu, idx);
+
+	/*
+	 * Must be called after all mmu_notifier_put are done and before
+	 * kfd_process_wq is released.
+	 *
+	 * Ensures that all outstanding free_notifier gets called, triggering the release
+	 * of the process.
+	 */
+	mmu_notifier_synchronize();
+}
+
+
  static int kfd_process_init_cwsr_apu(struct kfd_process *p, struct file *filep)
  {
  	unsigned long  offset;



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux