Am 18.10.24 um 23:59 schrieb Philip Yang:
On 2024-10-18 14:28, Felix Kuehling wrote:
On 2024-10-17 04:34, Victor Zhao wrote:
make sure KFD_FENCE_INIT write to fence_addr before
pm_send_query_status
called, to avoid qcm fence timeout caused by incorrect ordering.
Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
---
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c | 1 +
drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index b2b16a812e73..d9264a353775 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -2254,6 +2254,7 @@ static int unmap_queues_cpsch(struct
device_queue_manager *dqm,
goto out;
*dqm->fence_addr = KFD_FENCE_INIT;
+ mb();
pm_send_query_status(&dqm->packet_mgr, dqm->fence_gpu_addr,
KFD_FENCE_COMPLETED);
/* should be timed out */
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
index 09ab36f8e8c6..bddb169bb301 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.h
@@ -260,7 +260,7 @@ struct device_queue_manager {
uint16_t vmid_pasid[VMID_NUM];
uint64_t pipelines_addr;
uint64_t fence_gpu_addr;
- uint64_t *fence_addr;
+ volatile uint64_t *fence_addr;
[+Christian]
Is the volatile keyword really needed here? I just saw other patches
removing volatile in some places because it's not sufficient, and not
needed if you use memory barriers correctly.
After reading kernel memory barrier document and below link, I think
we need both volatile type and memory barrier, to guarantee F/W get
the updated fence value. This fixes an CP hang issue on SRIOV.
https://stackoverflow.com/questions/75750110/volatile-vs-memory-barriers#:~:text=volatile%20will%20make%20sure%20that,not%20reorder%20writes%20or%20reads.
No, that isn't correct. Using volatile is considered harmful and almost
never correct, see here
https://docs.kernel.org/process/volatile-considered-harmful.html
Placing appropriate memory barriers must be sufficient or otherwise
there is a rather bad platform or compiler bug lurking around.
Regards,
Christian.
Regards,
Philip
Regards,
Felix
struct kfd_mem_obj *fence_mem;
bool active_runlist;
int sched_policy;