Re: [PATCH 1/2] drm/amdgpu: Add timeout for sync wait

Felix Kuehling <felix.kuehling@xxxxxxx> · Fri, 20 Oct 2023 15:47:07 -0400

On 2023-10-20 09:10, Christian König wrote:
No, the wait forever is what is expected and perfectly valid user 
experience.

Waiting with a timeout on the other hand sounds like a really bad idea 
to me.

Every wait with a timeout needs a justification, e.g. for example that 
userspace explicitly specified it. And I absolutely don't see that here.
In this case the wait is in a kernel worker thread, and the wait is not 
interruptible. Not having a timeout means, you can have a kernel worker 
stuck forever. The restore worker also has retry logic already, so it 
can handle a timeout perfectly well. But maybe this shouldn't be done 
automatically for all callers of amdgpu_sync_wait, but only for this 
particular caller in the restore_process_worker. So we'd need to add a 
timeout parameter to amdgpu_sync_wait.

Regards,
  Felix



Regards,
Christian.

Am 20.10.23 um 10:52 schrieb Deng, Emily:
[AMD Official Use Only - General]

Hi Christian,
      The issue is running a compute hang with a quark and trigger a 
compute job timeout. For compute, the timeout setting is 60s, but for 
gfx and sdma, it is 10s.
So, get the timeout from the sched is reasonable for different sched.
     And if wait timeout, it will print error, so won't hint real 
issues. And even it has real issue, the wait forever is bad user 
experience, and driver couldn't work anymore.

Emily Deng
Best Wishes



-----Original Message-----
From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
Sent: Friday, October 20, 2023 3:29 PM
To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH 1/2] drm/amdgpu: Add timeout for sync wait

Am 20.10.23 um 08:13 schrieb Emily Deng:
Issue: Dead heappen during gpu recover, the call sequence as below:

amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset-
flush_delayed_work
-> amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait

It is because the amdgpu_sync_wait is waiting for the bad job's fence,
and never return, so the recover couldn't continue.


Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++--
   1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index dcd8c066bc1f..6253d6aab7f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync,
bool intr)
      int i, r;

      hash_for_each_safe(sync->fences, i, tmp, e, node) {
-            r = dma_fence_wait(e->fence, intr);
-            if (r)
+            struct drm_sched_fence *s_fence = to_drm_sched_fence(e-
fence);
+            long timeout = msecs_to_jiffies(10000);
That handling doesn't make much sense. If you need a timeout then 
you need
a timeout for the whole function.

Additional to that timeouts often just hide real problems which 
needs fixing.

So this here needs a much better justification otherwise it's a 
pretty clear NAK.

Regards,
Christian.

+
+            if (s_fence)
+                    timeout = s_fence->sched->timeout;
+
+            if (r == 0)
+                    r = -ETIMEDOUT;
+            if (r < 0)
                      return r;

              amdgpu_sync_entry_free(e);