Re: [PATCH 2/2] drm/amdgpu: Add timeout for sync wait

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 20.10.23 um 11:59 schrieb Emily Deng:
Issue: Dead heappen during gpu recover, the call sequence as below:

amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset->flush_delayed_work->
amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait

Resolving a deadlock with a timeout is illegal in general. So this patch here is an obvious no-go.

Additional to this problem Xinhu already investigated that the delayed work is causing issues during suspend because because flushing doesn't guarantee that a new one isn't started right after doing that.

After talking with Felix about this the correct solution is to stop flushing the delayed work and instead submitting it to the freezable work queue.

Regards,
Christian.


It is because the amdgpu_sync_wait is waiting for the bad job's fence, and
never return, so the recover couldn't continue.

Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++--
  1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index dcd8c066bc1f..9d4f122a7bf0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
  	int i, r;
hash_for_each_safe(sync->fences, i, tmp, e, node) {
-		r = dma_fence_wait(e->fence, intr);
-		if (r)
+		struct drm_sched_fence *s_fence = to_drm_sched_fence(e->fence);
+		long timeout = msecs_to_jiffies(10000);
+
+		if (s_fence)
+			timeout = s_fence->sched->timeout;
+		r = dma_fence_wait_timeout(e->fence, intr, timeout);
+		if (r == 0)
+			r = -ETIMEDOUT;
+		if (r < 0)
  			return r;
amdgpu_sync_entry_free(e);




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux