On 2024-10-21 04:12, Christian König
wrote:
Am 18.10.24 um 23:59 schrieb Philip Yang:
On 2024-10-18 14:28, Felix Kuehling wrote:
On 2024-10-17 04:34, Victor Zhao wrote:
make sure KFD_FENCE_INIT write to fence_addr before pm_send_query_status
called, to avoid qcm fence timeout caused by incorrect ordering.
Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
---
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index b2b16a812e73..d9264a353775 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2254,6 +2254,7 @@ static int unmap_queues_cpsch(struct device_queue_manager *dqm,
goto out;
*dqm->fence_addr = KFD_FENCE_INIT;
+ mb();
pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
KFD_FENCE_COMPLETED);
/* should be timed out */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index 09ab36f8e8c6..bddb169bb301 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -260,7 +260,7 @@ struct device_queue_manager {
uint16_t vmid_pasid[VMID_NUM];
uint64_t pipelines_addr;
uint64_t fence_gpu_addr;
- uint64_t *fence_addr;
+ volatile uint64_t *fence_addr;
[+Christian]
Is the volatile keyword really needed here? I just saw other patches removing volatile in some places because it's not sufficient, and not needed if you use memory barriers correctly.
After reading kernel memory barrier document and below link, I think we need both volatile type and memory barrier, to guarantee F/W get the updated fence value. This fixes an CP hang issue on SRIOV.
https://stackoverflow.com/questions/75750110/volatile-vs-memory-barriers#:~:text=volatile%20will%20make%20sure%20that,not%20reorder%20writes%20or%20reads.
No, that isn't correct. Using volatile is considered harmful and almost never correct, see here https://docs.kernel.org/process/volatile-considered-harmful.html
Placing appropriate memory barriers must be sufficient or otherwise there is a rather bad platform or compiler bug lurking around.
Yes, Victor confirmed that memory barrier fixes the issue, will send new patch to remove the volatile type.
Regards,
Philip
Regards,
Christian.
Regards,
Philip
Regards,
Felix
struct kfd_mem_obj *fence_mem;
bool active_runlist;
int sched_policy;