From: "Jesse.zhang@xxxxxxx" <Jesse.zhang@xxxxxxx> This patch updates the `amdgpu_job_timedout` function to check if the ring is actually guilty of causing the timeout. If not, it skips error handling and fence completion. Suggested-by: Alex Deucher <alexander.deucher@xxxxxxx> Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx> --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 100f04475943..f94c876db72b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -101,6 +101,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) /* Effectively the job is aborted as the device is gone */ return DRM_GPU_SCHED_STAT_ENODEV; } + /* Check if the ring is actually guilty of causing the timeout. + * If not, skip error handling and fence completion. + */ + if (amdgpu_gpu_recovery && ring->funcs->is_guilty) { + if (!ring->funcs->is_guilty(ring)) { + dev_err(adev->dev, "ring %s timeout, but not guilty\n", + s_job->sched->name); + goto exit; + } + } /* * Do the coredump immediately after a job timeout to get a very -- 2.25.1