On 2024-10-18 14:28, Felix Kuehling
wrote:
On 2024-10-17 04:34, Victor Zhao wrote:
make sure KFD_FENCE_INIT write to fence_addr before pm_send_query_status
called, to avoid qcm fence timeout caused by incorrect ordering.
Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
---
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index b2b16a812e73..d9264a353775 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2254,6 +2254,7 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
goto out;
*dqm->fence_addr = KFD_FENCE_INIT;
+ mb();
pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
KFD_FENCE_COMPLETED);
/* should be timed out */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index 09ab36f8e8c6..bddb169bb301 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -260,7 +260,7 @@ struct device_queue_manager {
uint16_t vmid_pasid[VMID_NUM];
uint64_t pipelines_addr;
uint64_t fence_gpu_addr;
- uint64_t *fence_addr;
+ volatile uint64_t *fence_addr;
[+Christian]
Is the volatile keyword really needed here? I just saw other patches removing volatile in some places because it's not sufficient, and not needed if you use memory barriers correctly.
After reading kernel memory barrier document and below link, I
think we need both volatile type and memory barrier, to guarantee
F/W get the updated fence value. This fixes an CP hang issue on
SRIOV.
Regards,
Philip
Regards,
Felix
struct kfd_mem_obj *fence_mem;
bool active_runlist;
int sched_policy;