Re: [PATCH v3 2/2] drm/amdkfd: pause autosuspend when creating pdd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2024-12-04 18:36, Felix Kuehling wrote:

On 2024-12-03 09:30, Yunxiang Li wrote:
When using MES creating a pdd will require talking to the GPU to setup
the relevant context. The code here forgot to wake up the GPU in case it
was in suspend, this causes KVM to EFAULT for passthrough GPU for
example. This issue can be masked if the GPU was woken up by other
things (e.g. opening the KMS node) first and have not yet gone to sleep.

Fixes: cc009e613de6 ("drm/amdkfd: Add KFD support for soc21 v3")
Signed-off-by: Yunxiang Li <Yunxiang.Li@xxxxxxx>
---
v3: remove the cleanup in kfd_bind_process_to_device and document why
this issue doesn't always happen

  drivers/gpu/drm/amd/amdkfd/kfd_process.c | 7 +++++++
  1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 555a892fcf963..c81c020af75d1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -1635,12 +1635,19 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_node *dev,
      atomic64_set(&pdd->evict_duration_counter, 0);
        if (dev->kfd->shared_resources.enable_mes) {
+        retval = pm_runtime_resume_and_get(bdev);
+        if (retval < 0) {
+            pr_err("failed to stop autosuspend\n");
+            goto err_free_pdd;
+        }
          retval = amdgpu_amdkfd_alloc_gtt_mem(adev,
                          AMDGPU_MES_PROC_CTX_SIZE,
                          &pdd->proc_ctx_bo,
                          &pdd->proc_ctx_gpu_addr,
                          &pdd->proc_ctx_cpu_ptr,
                          false);

As far as I can see from grepping the code, this BO is never used. It is allocated here and freed in kfd_process_destroy_pdds, and that's it.

I see a different proc_ctx_bo allocation in amdgpu_mes_create_process but I don't see that function being called anywhere. Either my grep-Fu is getting rusty, or there is some dead code and data structures surrounding MES here.

So unless I'm missing something, we can just remove this proc_ctx_bo completely.

OK, I was missing that proc_ctx_gpu_addr is used in add_queue_mes.

One other suggestion would be to do the allocation of proc_ctx_bo in a lazy fashion when the first queue is created in a process. Then it would naturally happen after the GPU has been taken out of runtime-PM in kfd_bind_process_to_device. This would avoid unnecessarily waking up all GPUs in the system every time a KFD process is started.

Regards,
  Felix



Regards,
  Felix



+        pm_runtime_mark_last_busy(bdev);
+        pm_runtime_put_autosuspend(bdev);
          if (retval) {
              dev_err(bdev,
                  "failed to allocate process context bo\n");



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux