[AMD Official Use Only - General] > -----Original Message----- > From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > Sent: Wednesday, March 8, 2023 4:08 AM > To: Belanger, David <David.Belanger@xxxxxxx>; Koenig, Christian > <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module > exit. > > Caution: This message originated from an External Source. Use proper > caution when opening attachments, clicking links, or responding. > > > Am 07.03.23 um 16:28 schrieb Belanger, David: > > [AMD Official Use Only - General] > > > > > > The test case is a python program that will load the driver, do some > operations, then unload the driver. > > What do you mean with unloading the driver? Removing the module? Or > destroying the device? > The python program calls a shell script that does "modprobe amdgpu". Calls some SMI operation to get some events. Then it calls a shell scripts that does "modprobe -r amdgpu". Then it exits. There will be ref on the kfd_process that will remain, which will be released only when mmu_notifier ops->release is called. This does not get called until the python process ends. The test program is definitively not the typical use a general user would do. > > When the driver exists, there is still the python process space around > holding on the address space. > > When the python process space exits, the mmu_notifier gets called but the > driver has already been unloaded. > > > > The goal of the fix is to address case where there could be > > outstanding address space / worker threads for process cleanup that needs > to be cleared/completed at exit time. > > Yeah and when the module is unloaded this is a completely futile effort. > > The general upstream approach is to take references on the struct device and > module and prevent unloading as long as those references exists. > > The device might be non-functional any more (because for example of hot > plug), but the driver should never be unloaded before the python program > exits. Thank you for your feedback, I will investigate that approach. > > Regards, > Christian. > > > > > Regards, > > David B. > > > >> -----Original Message----- > >> From: Koenig, Christian <Christian.Koenig@xxxxxxx> > >> Sent: Tuesday, March 7, 2023 2:05 AM > >> To: Belanger, David <David.Belanger@xxxxxxx>; amd- > >> gfx@xxxxxxxxxxxxxxxxxxxxx > >> Subject: Re: [PATCH] drm/amdkfd: Fixed kfd_process cleanup on module > >> exit. > >> > >> Am 06.03.23 um 22:58 schrieb David Belanger: > >>> Handle case when module is unloaded (kfd_exit) before a process > >>> space > >>> (mm_struct) is released. > >> Well that should never ever happen in the first place. It sounds like > >> we are missing grabbing module references. > >> > >> Regards, > >> Christian. > >> > >>> Signed-off-by: David Belanger <david.belanger@xxxxxxx> > >>> --- > >>> drivers/gpu/drm/amd/amdkfd/kfd_module.c | 4 ++ > >>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 57 > >> ++++++++++++++++++++++++ > >>> 2 files changed, 61 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_module.c > >>> b/drivers/gpu/drm/amd/amdkfd/kfd_module.c > >>> index 09b966dc3768..8ef4bd9e4f7d 100644 > >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_module.c > >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_module.c > >>> @@ -26,6 +26,9 @@ > >>> #include "kfd_priv.h" > >>> #include "amdgpu_amdkfd.h" > >>> > >>> +void kfd_cleanup_processes(void); > >>> + > >>> + > >>> static int kfd_init(void) > >>> { > >>> int err; > >>> @@ -77,6 +80,7 @@ static int kfd_init(void) > >>> > >>> static void kfd_exit(void) > >>> { > >>> + kfd_cleanup_processes(); > >>> kfd_debugfs_fini(); > >>> kfd_process_destroy_wq(); > >>> kfd_procfs_shutdown(); > >>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c > >>> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c > >>> index ebabe92f7edb..b5b28a32639d 100644 > >>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c > >>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c > >>> @@ -1181,6 +1181,17 @@ static void > >>> kfd_process_notifier_release(struct > >> mmu_notifier *mn, > >>> return; > >>> > >>> mutex_lock(&kfd_processes_mutex); > >>> + /* > >>> + * Do early return if p is not in the table. > >>> + * > >>> + * This could potentially happen if this function is called concurrently > >>> + * by mmu_notifier and by kfd_cleanup_pocesses. > >>> + * > >>> + */ > >>> + if (!hash_hashed(&p->kfd_processes)) { > >>> + mutex_unlock(&kfd_processes_mutex); > >>> + return; > >>> + } > >>> hash_del_rcu(&p->kfd_processes); > >>> mutex_unlock(&kfd_processes_mutex); > >>> synchronize_srcu(&kfd_processes_srcu); > >>> @@ -1200,6 +1211,52 @@ static const struct mmu_notifier_ops > >> kfd_process_mmu_notifier_ops = { > >>> .free_notifier = kfd_process_free_notifier, > >>> }; > >>> > >>> + > >>> +void kfd_cleanup_processes(void) > >>> +{ > >>> + struct kfd_process *p; > >>> + unsigned int temp; > >>> + > >>> + /* > >>> + * Iterate over remaining processes in table, calling notifier release > >>> + * to free up notifier and process resources. > >>> + * > >>> + * This code handles the case when driver is unloaded before all > >> mm_struct > >>> + * are released. > >>> + */ > >>> + int idx = srcu_read_lock(&kfd_processes_srcu); > >>> + > >>> + hash_for_each_rcu(kfd_processes_table, temp, p, kfd_processes) { > >>> + if (p) { > >>> + /* > >>> + * Obtain a reference on p to avoid a late > >> mmu_notifier release > >>> + * call triggering freeing the process. > >>> + */ > >>> + > >>> + kref_get(&p->ref); > >>> + > >>> + srcu_read_unlock(&kfd_processes_srcu, idx); > >>> + > >>> + kfd_process_notifier_release(&p->mmu_notifier, > >>> + p- > >>> mm); > >>> + > >>> + kfd_unref_process(p); > >>> + > >>> + idx = srcu_read_lock(&kfd_processes_srcu); > >>> + } > >>> + } > >>> + srcu_read_unlock(&kfd_processes_srcu, idx); > >>> + > >>> + /* > >>> + * Must be called after all mmu_notifier_put are done and before > >>> + * kfd_process_wq is released. > >>> + * > >>> + * Ensures that all outstanding free_notifier gets called, > >>> + triggering the > >> release > >>> + * of the process. > >>> + */ > >>> + mmu_notifier_synchronize(); > >>> +} > >>> + > >>> + > >>> static int kfd_process_init_cwsr_apu(struct kfd_process *p, > >>> struct file > >> *filep) > >>> { > >>> unsigned long offset;