Re: [PATCH] drm/amdgpu: workaround for TLB seq race

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Fri, 4 Nov 2022 08:10:10 +0100

Am 03.11.22 um 22:18 schrieb Philip Yang:

On 2022-11-02 10:58, Christian König wrote:
It can happen that we query the sequence value before the callback
had a chance to run.

Work around that by grabbing the fence lock and releasing it again.
Should be replaced by hw handling soon.

kfd_flush_tlb is always called after waiting for map/unmap to GPU 
fence signalled, that means the callback is already executed

And exactly that's incorrect.

Waiting for the fence to signal means that the callback has started 
executing, but it doesn't mean that it is finished.

This can then result in one CPU racing with the callback handler and 
because of this you see the wrong TLB seq.

Regards,
Christian.

and the sequence is increased if tlb flush is needed, so no such race 
from KFD.

I am not sure but seems the race does exist for amdgpu to grab vm and 
schedule job.

Acked-by: Philip Yang <Philip.Yang@xxxxxxx>

Signed-off-by: Christian König <christian.koenig@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 15 +++++++++++++++
  1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 9ecb7f663e19..e51a46c9582b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -485,6 +485,21 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm 
*vm, struct seq_file *m);
   */
  static inline uint64_t amdgpu_vm_tlb_seq(struct amdgpu_vm *vm)
  {
+    unsigned long flags;
+    spinlock_t *lock;
+
+    /*
+     * Work around to stop racing between the fence signaling and 
handling
+     * the cb. The lock is static after initially setting it up, 
just make
+     * sure that the dma_fence structure isn't freed up.
+     */
+    rcu_read_lock();
+    lock = vm->last_tlb_flush->lock;
+    rcu_read_unlock();
+
+    spin_lock_irqsave(lock, flags);
+    spin_unlock_irqrestore(lock, flags);
+
      return atomic64_read(&vm->tlb_seq);
  }