Re: [RFC PATCH] drm/amdkfd: Run restore_workers on freezable WQs

Christian König <christian.koenig@xxxxxxx> · Mon, 30 Oct 2023 19:00:49 +0100




Am 30.10.23 um 18:56 schrieb Felix Kuehling:
On 2023-10-30 13:48, Christian König wrote:


Am 30.10.23 um 18:38 schrieb Felix Kuehling:
On 2023-10-30 12:16, Christian König wrote:
@@ -1904,6 +1906,19 @@ kfd_process_gpuid_from_node(struct 
kfd_process *p, struct kfd_node *node,
      return -EINVAL;
  }
  +static void signal_eviction_fence(struct kfd_process *p)
+{
+    spin_lock(&p->ef_lock);
+    if (!p->ef)
+        goto unlock_out;
+    dma_fence_signal(p->ef);

This needs to grab the internal lock of the eviction fence, I'm 
not sure that has correct ordering with the newly added ef_lock.

Hmm, the only thing we could get a circular lock dependency would 
be, if we took the p->ef_lock in a fence callback. I don't think 
that ever happens, because even the eviction work runs on a worker 
thread (exactly to avoid such lock dependency issues).

Anyway, I could try to move the fence_signal out of the critical 
section. The lock is only there to ensure that exactly one process 
signals and frees the fence.


So basically either the eviction worker or the GPU reset could 
signal this fence.

In that case I think it would be simpler to grab the reset lock in 
the eviction worker to protect against a concurrent reset.

Which reset lock? adev->reset_cntl->reset_lock? I only see that lock 
taken in aldebaran_mode2_perform_reset. I don't understand why this 
is in ASIC-specific code. But even so, it's only taken during the 
actual reset (in aldebaran_mode2_perform_reset). I don't think it 
covers the pre-reset code path that signals the eviction fence.

No, what I mean is adev->reset_domain->sem. It's hold in write lock 
during the reset and you can grab the read side if you need to delay 
a reset.

But thinking about that a bit more, you actually don't need any of 
this. Something like this here should do fine:

tmp = dma_fence_get_rcu_safe(&p->ef);
dma_fence_signal(tmp);
dma_fence_put(tmp);

dma_fence_get_rcu_safe gets a new reference that dma_fence_put drops. 
It doesn't drop the original reference in p->ef.

I need a way to ensure that exactly one thread frees the original 
reference in p->ef. Even with RCU, concurrent writers still need a 
lock or a mutex. So I don't think I can avoid using another lock here 
because I have potentially two concurrent writers.

Mhm, why would you need that? Isn't p->ef be replaced with a new fence 
when the process is restored?

Or do you go from fence->NULL->fence? If that's the case then yeah the 
lock is probably your only option.

Regards,
Christian.


Regards,
  Felix



dma_fences are always RCU protected and can be signaled from multiple 
sources by design.

Regards,
Christian.


Regards,
  Felix



Regards,
Christian.