On 2019-10-18 1:36 p.m., Yang, Philip wrote: > If device is locked for suspend and resume, kfd open should return > failed -EAGAIN without creating process, otherwise the application exit > to release the process will hang to wait for resume is done if the suspend > and resume is stuck somewhere. This is backtrace: > > v2: fix processes that were created before suspend/resume got stuck > > [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more > than 120 seconds. > [Thu Oct 17 16:43:37 2019] Not tainted > 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 > [Thu Oct 17 16:43:37 2019] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Thu Oct 17 16:43:37 2019] rocminfo D 0 3024 2947 > 0x80000000 > [Thu Oct 17 16:43:37 2019] Call Trace: > [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 > [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 > [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 > [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 > [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 > [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 > [amdgpu] > [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 > [amdgpu] > [Thu Oct 17 16:43:37 2019] > kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] > [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 > [amdgpu] > [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 > [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 > [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 > [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 > [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 > [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 > [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 > [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 > [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 > [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 > [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx> > --- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 6 +++--- > drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 6 ++++++ > 2 files changed, 9 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > index d9e36dbf13d5..40d75c39f08e 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c > @@ -120,13 +120,13 @@ static int kfd_open(struct inode *inode, struct file *filep) > return -EPERM; > } > > + if (kfd_is_locked()) > + return -EAGAIN; > + > process = kfd_create_process(filep); > if (IS_ERR(process)) > return PTR_ERR(process); > > - if (kfd_is_locked()) > - return -EAGAIN; > - > dev_dbg(kfd_device, "process %d opened, compat mode (32 bit) - %d\n", > process->pasid, process->is_32bit_user_mode); > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > index 8509814a6ff0..3784013b92a0 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c > @@ -128,6 +128,12 @@ void kfd_process_dequeue_from_all_devices(struct kfd_process *p) > { > struct kfd_process_device *pdd; > > + /* If suspend/resume got stuck, dqm_lock is hold, > + * skip process_termination_cpsch to avoid deadlock > + */ > + if (kfd_is_locked()) > + return; > + Holding the DQM lock during reset has caused other problems (lock dependency issues and deadlocks) and I was thinking about getting rid of that completely. The intention of holding the DQM lock during reset was to prevent the device queue manager from accessing the CP hardware while a reset was in progress. However, I think there are smarter ways to achieve that. We already get a pre-reset callback (kgd2kfd_pre_reset) that executes the kgd2kfd_suspend, which suspends processes and stops DQM through kfd->dqm->ops.stop(kfd->dqm). This should take care of most of the problem. If there are any places in DQM that try to access the devices, they should add conditions to not access HW while DQM is stopped. Then we could avoid holding a lock indefinitely while a reset is in progress. The DQM lock is particularly problematic in terms of lock dependencies because it can be taken in MMU notifiers. We want to avoid taking any other locks while holding the DQM lock. Holding the DQM lock for a long time during reset is counterproductive to that objective. Regards, Felix > list_for_each_entry(pdd, &p->per_device_data, per_device_list) > kfd_process_dequeue_from_device(pdd); > } _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx