Issue: Dead heappen during gpu recover, the call sequence as below: amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset->flush_delayed_work-> amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait It is because the amdgpu_sync_wait is waiting for the bad job's fence, and never return, so the recover couldn't continue. Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index dcd8c066bc1f..9d4f122a7bf0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr) int i, r; hash_for_each_safe(sync->fences, i, tmp, e, node) { - r = dma_fence_wait(e->fence, intr); - if (r) + struct drm_sched_fence *s_fence = to_drm_sched_fence(e->fence); + long timeout = msecs_to_jiffies(10000); + + if (s_fence) + timeout = s_fence->sched->timeout; + r = dma_fence_wait_timeout(e->fence, intr, timeout); + if (r == 0) + r = -ETIMEDOUT; + if (r < 0) return r; amdgpu_sync_entry_free(e); -- 2.36.1