Re: [PATCH] drm/amdkfd: fix the hang caused by the write reorder to fence_addr

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2024-10-21 04:12, Christian König wrote:
Am 18.10.24 um 23:59 schrieb Philip Yang:
On 2024-10-18 14:28, Felix Kuehling wrote:

On 2024-10-17 04:34, Victor Zhao wrote:
make sure KFD_FENCE_INIT write to fence_addr before pm_send_query_status
called, to avoid qcm fence timeout caused by incorrect ordering.

Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
---
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 +
  drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2 +-
  2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index b2b16a812e73..d9264a353775 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2254,6 +2254,7 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
          goto out;
        *dqm->fence_addr = KFD_FENCE_INIT;
+    mb();
      pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
                  KFD_FENCE_COMPLETED);
      /* should be timed out */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index 09ab36f8e8c6..bddb169bb301 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -260,7 +260,7 @@ struct device_queue_manager {
      uint16_t        vmid_pasid[VMID_NUM];
      uint64_t        pipelines_addr;
      uint64_t        fence_gpu_addr;
-    uint64_t        *fence_addr;
+    volatile uint64_t    *fence_addr;

[+Christian]

Is the volatile keyword really needed here? I just saw other patches removing volatile in some places because it's not sufficient, and not needed if you use memory barriers correctly.

After reading kernel memory barrier document and below link, I think we need both volatile type and memory barrier, to guarantee F/W get the updated fence value. This fixes an CP hang issue on SRIOV.

https://stackoverflow.com/questions/75750110/volatile-vs-memory-barriers#:~:text=volatile%20will%20make%20sure%20that,not%20reorder%20writes%20or%20reads.


No, that isn't correct. Using volatile is considered harmful and almost never correct, see here https://docs.kernel.org/process/volatile-considered-harmful.html

Placing appropriate memory barriers must be sufficient or otherwise there is a rather bad platform or compiler bug lurking around.

Yes, Victor confirmed that memory barrier fixes the issue, will send new patch to remove the volatile type.

Regards,

Philip


Regards,
Christian.

Regards,

Philip


Regards,
  Felix


      struct kfd_mem_obj    *fence_mem;
      bool            active_runlist;
      int            sched_policy;


[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux