Re: [RFC PATCH] drm/amdkfd: Run restore_workers on freezable WQs

Felix Kuehling <felix.kuehling@xxxxxxx> · Mon, 30 Oct 2023 13:38:16 -0400




On 2023-10-30 12:16, Christian König wrote:
@@ -1904,6 +1906,19 @@ kfd_process_gpuid_from_node(struct 
kfd_process *p, struct kfd_node *node,
      return -EINVAL;
  }
  +static void signal_eviction_fence(struct kfd_process *p)
+{
+    spin_lock(&p->ef_lock);
+    if (!p->ef)
+        goto unlock_out;
+    dma_fence_signal(p->ef);

This needs to grab the internal lock of the eviction fence, I'm not 
sure that has correct ordering with the newly added ef_lock.

Hmm, the only thing we could get a circular lock dependency would be, 
if we took the p->ef_lock in a fence callback. I don't think that 
ever happens, because even the eviction work runs on a worker thread 
(exactly to avoid such lock dependency issues).

Anyway, I could try to move the fence_signal out of the critical 
section. The lock is only there to ensure that exactly one process 
signals and frees the fence.


So basically either the eviction worker or the GPU reset could signal 
this fence.

In that case I think it would be simpler to grab the reset lock in the 
eviction worker to protect against a concurrent reset.

Which reset lock? adev->reset_cntl->reset_lock? I only see that lock 
taken in aldebaran_mode2_perform_reset. I don't understand why this is 
in ASIC-specific code. But even so, it's only taken during the actual 
reset (in aldebaran_mode2_perform_reset). I don't think it covers the 
pre-reset code path that signals the eviction fence.

Regards,
  Felix



Regards,
Christian.