Handle queue destroy failure while CP hang. Once CP got hang, kfd trigger GPU reset and set related flags to stop driver touching the queue. As we leave the queue as it is, we need keep the resource as it is too. Regardless user-space tries to destroy the queue again or not. We need put queue back to the list so process termination would do the cleanup work. What's more, if userspace tries to destroy the queue again, we would not free its resource twice. Kfd return -EIO in this case, so lets handle it now. Paste some error log below without this patch. amdgpu: Can't create new usermode queue because -1 queues were already created refcount_t: underflow; use-after-free. Call Trace: kobject_put+0xe6/0x1b0 kfd_procfs_del_queue+0x37/0x50 [amdgpu] pqm_destroy_queue+0x17a/0x390 [amdgpu] kfd_ioctl_destroy_queue+0x57/0xc0 [amdgpu] kfd_ioctl+0x463/0x690 [amdgpu] BUG kmalloc-32 (Tainted: G W ): Object already free INFO: Allocated in allocate_sdma_mqd+0x30/0xb0 [amdgpu] age=4796 cpu=2 pid=2511 __slab_alloc+0x72/0x80 kmem_cache_alloc_trace+0x81f/0x8c0 allocate_sdma_mqd+0x30/0xb0 [amdgpu] create_queue_cpsch+0xbf/0x470 [amdgpu] pqm_create_queue+0x28d/0x6d0 [amdgpu] kfd_ioctl_create_queue+0x492/0xae0 [amdgpu] INFO: Freed in free_mqd_hiq_sdma+0x20/0x60 [amdgpu] age=2537 cpu=7 pid=2511 kfree+0x322/0x340 free_mqd_hiq_sdma+0x20/0x60 [amdgpu] destroy_queue_cpsch+0x20c/0x330 [amdgpu] pqm_destroy_queue+0x1a3/0x390 [amdgpu] kfd_ioctl_destroy_queue+0x57/0xc0 [amdgpu] Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx> --- .../gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 13 +++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 +++- .../gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 2 ++ 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c index c069fa259b30..63a9a19a3987 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c @@ -1530,6 +1530,11 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm, KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0); if (retval == -ETIME) qpd->reset_wavefronts = true; + /* In gpu reset? We leave the queue as it is, so do NOT + * cleanup the resource. + */ + else if (retval == -EIO) + goto failed_execute_queue; if (q->properties.is_gws) { dqm->gws_queue_count--; qpd->mapped_gws_queue = false; @@ -1551,6 +1556,14 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm, return retval; +failed_execute_queue: + /* Put queue back to the list, then we have chance to destroy it. + * FIXME: we do NOT want the queue in the runlist again. + */ + list_add(&q->list, &qpd->queues_list); + qpd->queue_count++; + if (q->properties.is_active) + increment_queue_count(dqm, q->properties.type); failed_try_destroy_debugged_queue: dqm_unlock(dqm); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 09b98a83f670..984197e5929f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -607,11 +607,13 @@ static int kfd_procfs_add_sysfs_files(struct kfd_process *p) void kfd_procfs_del_queue(struct queue *q) { - if (!q) + if (!q || !kobject_get_unless_zero(&q->kobj)) return; kobject_del(&q->kobj); kobject_put(&q->kobj); + /* paired with the get above */ + kobject_put(&q->kobj); } int kfd_process_create_wq(void) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c index 95a6c36cea4c..0588e552b8ec 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c @@ -373,6 +373,7 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, unsigned int qid) dqm = pqn->kq->dev->dqm; dqm->ops.destroy_kernel_queue(dqm, pqn->kq, &pdd->qpd); kernel_queue_uninit(pqn->kq, false); + pqn->kq = NULL; } if (pqn->q) { @@ -396,6 +397,7 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, unsigned int qid) kfree(pqn->q->properties.cu_mask); pqn->q->properties.cu_mask = NULL; uninit_queue(pqn->q); + pqn->q = NULL; } list_del(&pqn->process_queue_list); -- 2.25.1 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx