Am 20.10.23 um 11:59 schrieb Emily Deng:
Issue: Dead heappen during gpu recover, the call sequence as below:
amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset->flush_delayed_work->
amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait
Resolving a deadlock with a timeout is illegal in general. So this patch
here is an obvious no-go.
Additional to this problem Xinhu already investigated that the delayed
work is causing issues during suspend because because flushing doesn't
guarantee that a new one isn't started right after doing that.
After talking with Felix about this the correct solution is to stop
flushing the delayed work and instead submitting it to the freezable
work queue.
Regards,
Christian.
It is because the amdgpu_sync_wait is waiting for the bad job's fence, and
never return, so the recover couldn't continue.
Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index dcd8c066bc1f..9d4f122a7bf0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
int i, r;
hash_for_each_safe(sync->fences, i, tmp, e, node) {
- r = dma_fence_wait(e->fence, intr);
- if (r)
+ struct drm_sched_fence *s_fence = to_drm_sched_fence(e->fence);
+ long timeout = msecs_to_jiffies(10000);
+
+ if (s_fence)
+ timeout = s_fence->sched->timeout;
+ r = dma_fence_wait_timeout(e->fence, intr, timeout);
+ if (r == 0)
+ r = -ETIMEDOUT;
+ if (r < 0)
return r;
amdgpu_sync_entry_free(e);