Am 03.11.22 um 22:18 schrieb Philip Yang:
On 2022-11-02 10:58, Christian König wrote:
It can happen that we query the sequence value before the callback
had a chance to run.
Work around that by grabbing the fence lock and releasing it again.
Should be replaced by hw handling soon.
kfd_flush_tlb is always called after waiting for map/unmap to GPU
fence signalled, that means the callback is already executed
And exactly that's incorrect.
Waiting for the fence to signal means that the callback has started
executing, but it doesn't mean that it is finished.
This can then result in one CPU racing with the callback handler and
because of this you see the wrong TLB seq.
Regards,
Christian.
and the sequence is increased if tlb flush is needed, so no such race
from KFD.
I am not sure but seems the race does exist for amdgpu to grab vm and
schedule job.
Acked-by: Philip Yang <Philip.Yang@xxxxxxx>
Signed-off-by: Christian König <christian.koenig@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 9ecb7f663e19..e51a46c9582b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -485,6 +485,21 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm
*vm, struct seq_file *m);
*/
static inline uint64_t amdgpu_vm_tlb_seq(struct amdgpu_vm *vm)
{
+ unsigned long flags;
+ spinlock_t *lock;
+
+ /*
+ * Work around to stop racing between the fence signaling and
handling
+ * the cb. The lock is static after initially setting it up,
just make
+ * sure that the dma_fence structure isn't freed up.
+ */
+ rcu_read_lock();
+ lock = vm->last_tlb_flush->lock;
+ rcu_read_unlock();
+
+ spin_lock_irqsave(lock, flags);
+ spin_unlock_irqrestore(lock, flags);
+
return atomic64_read(&vm->tlb_seq);
}