On 2022-07-05 15:02, philip yang wrote:
On 2022-07-05 14:48, Felix Kuehling wrote:
On 2022-06-10 21:03, Philip Yang wrote:
If destroying/unmapping queue failed, application may destroy queue
again, cause below kernel warning backtrace.
For outstanding queues, either applications forget to destroy or failed
to destroy, kfd_process_notifier_release will remove queue sysfs
objects, kfd_process_wq_release will free queue doorbell.
refcount_t: underflow; use-after-free.
WARNING: CPU: 7 PID: 3053 at lib/refcount.c:28
Call Trace:
kobject_put+0xd6/0x1a0
kfd_procfs_del_queue+0x27/0x30 [amdgpu]
pqm_destroy_queue+0xeb/0x240 [amdgpu]
kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
kfd_ioctl+0x27d/0x500 [amdgpu]
do_syscall_64+0x35/0x80
WARNING: CPU: 2 PID: 3053 at
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:400
Call Trace:
deallocate_doorbell.isra.0+0x39/0x40 [amdgpu]
destroy_queue_cpsch+0xb3/0x270 [amdgpu]
pqm_destroy_queue+0x108/0x240 [amdgpu]
kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
kfd_ioctl+0x27d/0x500 [amdgpu]
Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx>
---
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 4 ++--
drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index e1797657b04c..1c519514ca1a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1876,8 +1876,6 @@ static int destroy_queue_cpsch(struct
device_queue_manager *dqm,
mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type(
q->properties.type)];
- deallocate_doorbell(qpd, q);
-
if ((q->properties.type == KFD_QUEUE_TYPE_SDMA) ||
(q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)) {
deallocate_sdma_queue(dqm, q);
@@ -1898,6 +1896,8 @@ static int destroy_queue_cpsch(struct
device_queue_manager *dqm,
}
}
+ deallocate_doorbell(qpd, q);
+
I'm not sure what this change is meant to do. I don't see an early
return in this function, so deallocate_doorbell will always be
executed either way.
If app destroy queue deallocate_doorbell, but then unmap queue failed,
app will destroy queue again when app exit, deallocate_doorbell second
time will trigger the WARN backtrace.
I get that. But even with your change, deallocate_doorbell will still be
executed if the queue unmap fails because there is no early return or
"goto error".
As queue_count and q->list is used to alloc ring buf in
execute_queues_cpsch, so this change causes regression on gfx9, I will
submit new patch to handle unmap failed case with MES and fix those
issues.
I think the intention of the code was to treat HWS in a way that does
not prevent queue destruction. Basically, there is not point reporting
HWS errors to the application because the application cannot do anything
about them anyway. Eventually it will cause a GPU reset and the
application will be killed. If you look at how -ETIME is handled in
pqm_destroy_queue, you see that it finishes the job regardless.
However, this has always been somewhat inconsistent. With MES maybe it's
getting worse because there may be other error conditions we didn't hit
before. Are you seeing those backtraces on a GPU with MES by any chance?
Regards,
Felix
Regards,
Philip
/*
* Unconditionally decrement this counter, regardless of the
queue's
* type
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index dc00484ff484..99f2a6412201 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -419,7 +419,6 @@ int pqm_destroy_queue(struct
process_queue_manager *pqm, unsigned int qid)
}
if (pqn->q) {
- kfd_procfs_del_queue(pqn->q);
dqm = pqn->q->device->dqm;
retval = dqm->ops.destroy_queue(dqm, &pdd->qpd, pqn->q);
if (retval) {
@@ -439,6 +438,7 @@ int pqm_destroy_queue(struct
process_queue_manager *pqm, unsigned int qid)
if (dev->shared_resources.enable_mes)
amdgpu_amdkfd_free_gtt_mem(dev->adev,
pqn->q->gang_ctx_bo);
+ kfd_procfs_del_queue(pqn->q);
This part makes sense.
Regards,
Felix
uninit_queue(pqn->q);
}