Re: [PATCH 1/1] drm/amdkfd: Remove queue sysfs and doorbell after unmapping

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2022-07-05 15:12, Felix Kuehling wrote:
On 2022-07-05 15:02, philip yang wrote:


On 2022-07-05 14:48, Felix Kuehling wrote:
On 2022-06-10 21:03, Philip Yang wrote:
If destroying/unmapping queue failed, application may destroy queue
again, cause below kernel warning backtrace.

For outstanding queues, either applications forget to destroy or failed
to destroy, kfd_process_notifier_release will remove queue sysfs
objects, kfd_process_wq_release will free queue doorbell.

  refcount_t: underflow; use-after-free.
  WARNING: CPU: 7 PID: 3053 at lib/refcount.c:28
   Call Trace:
    kobject_put+0xd6/0x1a0
    kfd_procfs_del_queue+0x27/0x30 [amdgpu]
    pqm_destroy_queue+0xeb/0x240 [amdgpu]
    kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
    kfd_ioctl+0x27d/0x500 [amdgpu]
    do_syscall_64+0x35/0x80

  WARNING: CPU: 2 PID: 3053 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:400
   Call Trace:
    deallocate_doorbell.isra.0+0x39/0x40 [amdgpu]
    destroy_queue_cpsch+0xb3/0x270 [amdgpu]
    pqm_destroy_queue+0x108/0x240 [amdgpu]
    kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
    kfd_ioctl+0x27d/0x500 [amdgpu]

Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx>
---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c  | 4 ++--
  drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c | 2 +-
  2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index e1797657b04c..1c519514ca1a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1876,8 +1876,6 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm,
      mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type(
              q->properties.type)];
  -    deallocate_doorbell(qpd, q);
-
      if ((q->properties.type == KFD_QUEUE_TYPE_SDMA) ||
          (q->properties.type == KFD_QUEUE_TYPE_SDMA_XGMI)) {
          deallocate_sdma_queue(dqm, q);
@@ -1898,6 +1896,8 @@ static int destroy_queue_cpsch(struct device_queue_manager *dqm,
          }
      }
  +    deallocate_doorbell(qpd, q);
+

I'm not sure what this change is meant to do. I don't see an early return in this function, so deallocate_doorbell will always be executed either way.

If app destroy queue deallocate_doorbell, but then unmap queue failed, app will destroy queue again when app exit, deallocate_doorbell second time will trigger the WARN backtrace.

I get that. But even with your change, deallocate_doorbell will still be executed if the queue unmap fails because there is no early return or "goto error".
Yes, that was fixed in patch v2. [PATCH v2 2/2] drm/amdkfd: Free queue after unmap queue success


As queue_count and q->list is used to alloc ring buf in execute_queues_cpsch, so this change causes regression on gfx9, I will submit new patch to handle unmap failed case with MES and fix those issues.

I think the intention of the code was to treat HWS in a way that does not prevent queue destruction. Basically, there is not point reporting HWS errors to the application because the application cannot do anything about them anyway. Eventually it will cause a GPU reset and the application will be killed. If you look at how -ETIME is handled in pqm_destroy_queue, you see that it finishes the job regardless.

However, this has always been somewhat inconsistent. With MES maybe it's getting worse because there may be other error conditions we didn't hit before. Are you seeing those backtraces on a GPU with MES by any chance?

Thanks for the info, MES returns other error code than -ETIME, so pqm_destroy_queue don't finish the destroy procedure and return error code to user space. Seems we just need change MES error handle the same way as old HWS, to finish the job regardless.

Thanks,

Philip

Regards,
  Felix


Regards,

Philip


      /*
       * Unconditionally decrement this counter, regardless of the queue's
       * type
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
index dc00484ff484..99f2a6412201 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
@@ -419,7 +419,6 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, unsigned int qid)
      }
        if (pqn->q) {
-        kfd_procfs_del_queue(pqn->q);
          dqm = pqn->q->device->dqm;
          retval = dqm->ops.destroy_queue(dqm, &pdd->qpd, pqn->q);
          if (retval) {
@@ -439,6 +438,7 @@ int pqm_destroy_queue(struct process_queue_manager *pqm, unsigned int qid)
          if (dev->shared_resources.enable_mes)
              amdgpu_amdkfd_free_gtt_mem(dev->adev,
                             pqn->q->gang_ctx_bo);
+        kfd_procfs_del_queue(pqn->q);

This part makes sense.

Regards,
  Felix


          uninit_queue(pqn->q);
      }

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux