Re: [PATCH 1/2] drm/amdgpu: Add timeout for sync wait

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Fri, 20 Oct 2023 09:29:06 +0200

Am 20.10.23 um 08:13 schrieb Emily Deng:
Issue: Dead heappen during gpu recover, the call sequence as below:

amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset->flush_delayed_work->
amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait

It is because the amdgpu_sync_wait is waiting for the bad job's fence, and
never return, so the recover couldn't continue.


Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++--
  1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index dcd8c066bc1f..6253d6aab7f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
  	int i, r;
  
  	hash_for_each_safe(sync->fences, i, tmp, e, node) {
-		r = dma_fence_wait(e->fence, intr);
-		if (r)
+		struct drm_sched_fence *s_fence = to_drm_sched_fence(e->fence);
+		long timeout = msecs_to_jiffies(10000);

That handling doesn't make much sense. If you need a timeout then you 
need a timeout for the whole function.

Additional to that timeouts often just hide real problems which needs 
fixing.

So this here needs a much better justification otherwise it's a pretty 
clear NAK.

Regards,
Christian.

+
+		if (s_fence)
+			timeout = s_fence->sched->timeout;
+
+		if (r == 0)
+			r = -ETIMEDOUT;
+		if (r < 0)
  			return r;
  
  		amdgpu_sync_entry_free(e);