RE: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

"Deucher, Alexander" <Alexander.Deucher@xxxxxxx> · Wed, 19 Feb 2025 23:26:45 +0000

[Public]

> -----Original Message-----
> From: jesse.zhang@xxxxxxx <jesse.zhang@xxxxxxx>
> Sent: Thursday, February 13, 2025 12:47 AM
> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix
> <Felix.Kuehling@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Zhu,
> Jiadong <Jiadong.Zhu@xxxxxxx>; Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>;
> Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Jesse(Jie)
> <Jesse.Zhang@xxxxxxx>
> Subject: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the
> ring is guilty
>
> From: "Jesse.zhang@xxxxxxx" <Jesse.zhang@xxxxxxx>
>
> This patch updates the `amdgpu_job_timedout` function to check if the ring is
> actually guilty of causing the timeout. If not, it skips error handling and fence
> completion.
>

Thinking about this more, I'm not sure if this is the right approach.  If we detect a timeout on a kernel ring, we still want to do the reset, but we don't want to kill the job if it's not guilty.  This approach makes sense if we have all kernel rings as we'll eventually get the timeout on the other ring and the reset will eventually get triggered.  But if the hang is on a KFD queue, it won't get noticed until we attempt to preempt the user queues for some other reason which may take a while.  How about the following instead.  We move the is_guilty check down into the queue reset area.  Something like this:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 100f044759435..48350d1030612 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -130,8 +130,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
                amdgpu_vm_put_task_info(ti);
        }

-       dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
-
        /* attempt a per ring reset */
        if (amdgpu_gpu_recovery &&
            ring->funcs->reset) {
@@ -146,13 +144,22 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
                        if (amdgpu_ring_sched_ready(ring))
                                drm_sched_stop(&ring->sched, s_job);
                        atomic_inc(&ring->adev->gpu_reset_counter);
-                       amdgpu_fence_driver_force_completion(ring);
+                       if (ring->funcs->is_guilty) {
+                               if (ring->funcs->is_guilty(ring)) {
+                                       dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+                                       amdgpu_fence_driver_force_completion(ring);
+                               }
+                       } else {
+                               amdgpu_fence_driver_force_completion(ring);
+                               dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
+                       }
                        if (amdgpu_ring_sched_ready(ring))
                                drm_sched_start(&ring->sched, 0);
                        goto exit;
                }
                dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
        }
+       dma_fence_set_error(&s_job->s_fence->finished, -ETIME);

        if (amdgpu_device_should_recover_gpu(ring->adev)) {
                struct amdgpu_reset_context reset_context;



> Suggested-by: Alex Deucher <alexander.deucher@xxxxxxx>
> Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 100f04475943..f94c876db72b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -101,6 +101,16 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
>               /* Effectively the job is aborted as the device is gone */
>               return DRM_GPU_SCHED_STAT_ENODEV;
>       }
> +     /* Check if the ring is actually guilty of causing the timeout.
> +     * If not, skip error handling and fence completion.
> +     */
> +     if (amdgpu_gpu_recovery && ring->funcs->is_guilty) {
> +             if (!ring->funcs->is_guilty(ring)) {
> +                     dev_err(adev->dev, "ring %s timeout, but not guilty\n",
> +                             s_job->sched->name);
> +                     goto exit;
> +             }
> +     }
>
>       /*
>        * Do the coredump immediately after a job timeout to get a very
> --
> 2.25.1