[PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

<jesse.zhang@xxxxxxx> · Thu, 13 Feb 2025 13:47:11 +0800

From: "Jesse.zhang@xxxxxxx" <Jesse.zhang@xxxxxxx>

This patch updates the `amdgpu_job_timedout` function to check if
the ring is actually guilty of causing the timeout. If not, it
skips error handling and fence completion.

Suggested-by: Alex Deucher <alexander.deucher@xxxxxxx>
Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 100f04475943..f94c876db72b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -101,6 +101,16 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		/* Effectively the job is aborted as the device is gone */
 		return DRM_GPU_SCHED_STAT_ENODEV;
 	}
+	/* Check if the ring is actually guilty of causing the timeout.
+	* If not, skip error handling and fence completion.
+	*/
+	if (amdgpu_gpu_recovery && ring->funcs->is_guilty) {
+		if (!ring->funcs->is_guilty(ring)) {
+			dev_err(adev->dev, "ring %s timeout, but not guilty\n",
+				s_job->sched->name);
+			goto exit;
+		}
+	}
 
 	/*
 	 * Do the coredump immediately after a job timeout to get a very
-- 
2.25.1