RE: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

"Zhang, Jesse(Jie)" <Jesse.Zhang@xxxxxxx> · Fri, 21 Feb 2025 02:25:41 +0000

[Public]

-----Original Message-----
From: Deucher, Alexander <Alexander.Deucher@xxxxxxx>
Sent: Thursday, February 20, 2025 10:22 PM
To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>
Subject: RE: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty

[Public]

> -----Original Message-----
> From: Deucher, Alexander
> Sent: Wednesday, February 19, 2025 6:27 PM
> To: jesse.zhang@xxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Kim, Jonathan
> <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; Zhang,
> Jesse(Jie) <Jesse.Zhang@xxxxxxx>; Zhang, Jesse(Jie)
> <Jesse.Zhang@xxxxxxx>
> Subject: RE: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to
> check if the ring is guilty
>
> > -----Original Message-----
> > From: jesse.zhang@xxxxxxx <jesse.zhang@xxxxxxx>
> > Sent: Thursday, February 13, 2025 12:47 AM
> > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix
> > <Felix.Kuehling@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Zhu,
> > Jiadong <Jiadong.Zhu@xxxxxxx>; Zhang, Jesse(Jie)
> > <Jesse.Zhang@xxxxxxx>; Deucher, Alexander
> <Alexander.Deucher@xxxxxxx>;
> > Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>
> > Subject: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to
> > check if the ring is guilty
> >
> > From: "Jesse.zhang@xxxxxxx" <Jesse.zhang@xxxxxxx>
> >
> > This patch updates the `amdgpu_job_timedout` function to check if
> > the ring is actually guilty of causing the timeout. If not, it skips
> > error handling and fence completion.
> >
>
> Thinking about this more, I'm not sure if this is the right approach.
> If we detect a timeout on a kernel ring, we still want to do the
> reset, but we don't want to kill the job if it's not guilty.  This
> approach makes sense if we have all kernel rings as we'll eventually
> get the timeout on the other ring and the reset will eventually get
> triggered.  But if the hang is on a KFD queue, it won't get noticed
> until we attempt to preempt the user queues for some other reason
> which may take a while.  How about the following instead.  We move the is_guilty check down into the queue reset area.  Something like this:
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 100f044759435..48350d1030612 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -130,8 +130,6 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
>               amdgpu_vm_put_task_info(ti);
>       }
>
> -     dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
> -
>       /* attempt a per ring reset */
>       if (amdgpu_gpu_recovery &&
>           ring->funcs->reset) {
> @@ -146,13 +144,22 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
>                       if (amdgpu_ring_sched_ready(ring))
>                               drm_sched_stop(&ring->sched, s_job);
>                       atomic_inc(&ring->adev->gpu_reset_counter);
> -                     amdgpu_fence_driver_force_completion(ring);
> +                     if (ring->funcs->is_guilty) {
> +                             if (ring->funcs->is_guilty(ring)) {
> +
> + dma_fence_set_error(&s_job->s_fence-
> >finished, -ETIME);
> +                                     amdgpu_fence_driver_force_completion(ring);
> +                             }
> +                     } else {
> +                             amdgpu_fence_driver_force_completion(ring);
> +
> + dma_fence_set_error(&s_job->s_fence->finished, -
> ETIME);
> +                     }
>                       if (amdgpu_ring_sched_ready(ring))
>                               drm_sched_start(&ring->sched, 0);
>                       goto exit;
>               }
>               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);
>       }
> +     dma_fence_set_error(&s_job->s_fence->finished, -ETIME);
>
>       if (amdgpu_device_should_recover_gpu(ring->adev)) {
>               struct amdgpu_reset_context reset_context;
>

Actually something like the attached patch.  Need to call is_guilty before reset.

Thanks Alex, will update before pushing to drm-next
 Jesse

Alex

>
>
> > Suggested-by: Alex Deucher <alexander.deucher@xxxxxxx>
> > Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 100f04475943..f94c876db72b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -101,6 +101,16 @@ static enum drm_gpu_sched_stat
> > amdgpu_job_timedout(struct drm_sched_job *s_job)
> >             /* Effectively the job is aborted as the device is gone */
> >             return DRM_GPU_SCHED_STAT_ENODEV;
> >     }
> > +   /* Check if the ring is actually guilty of causing the timeout.
> > +   * If not, skip error handling and fence completion.
> > +   */
> > +   if (amdgpu_gpu_recovery && ring->funcs->is_guilty) {
> > +           if (!ring->funcs->is_guilty(ring)) {
> > +                   dev_err(adev->dev, "ring %s timeout, but not guilty\n",
> > +                           s_job->sched->name);
> > +                   goto exit;
> > +           }
> > +   }
> >
> >     /*
> >      * Do the coredump immediately after a job timeout to get a very
> > --
> > 2.25.1