[Public] -----Original Message----- From: Deucher, Alexander <Alexander.Deucher@xxxxxxx> Sent: Thursday, February 20, 2025 10:22 PM To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx> Subject: RE: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guilty [Public] > -----Original Message----- > From: Deucher, Alexander > Sent: Wednesday, February 19, 2025 6:27 PM > To: jesse.zhang@xxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Kim, Jonathan > <Jonathan.Kim@xxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; Zhang, > Jesse(Jie) <Jesse.Zhang@xxxxxxx>; Zhang, Jesse(Jie) > <Jesse.Zhang@xxxxxxx> > Subject: RE: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to > check if the ring is guilty > > > -----Original Message----- > > From: jesse.zhang@xxxxxxx <jesse.zhang@xxxxxxx> > > Sent: Thursday, February 13, 2025 12:47 AM > > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix > > <Felix.Kuehling@xxxxxxx>; Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Zhu, > > Jiadong <Jiadong.Zhu@xxxxxxx>; Zhang, Jesse(Jie) > > <Jesse.Zhang@xxxxxxx>; Deucher, Alexander > <Alexander.Deucher@xxxxxxx>; > > Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx> > > Subject: [PATCH V7 5/9] drm/amdgpu: Update amdgpu_job_timedout to > > check if the ring is guilty > > > > From: "Jesse.zhang@xxxxxxx" <Jesse.zhang@xxxxxxx> > > > > This patch updates the `amdgpu_job_timedout` function to check if > > the ring is actually guilty of causing the timeout. If not, it skips > > error handling and fence completion. > > > > Thinking about this more, I'm not sure if this is the right approach. > If we detect a timeout on a kernel ring, we still want to do the > reset, but we don't want to kill the job if it's not guilty. This > approach makes sense if we have all kernel rings as we'll eventually > get the timeout on the other ring and the reset will eventually get > triggered. But if the hang is on a KFD queue, it won't get noticed > until we attempt to preempt the user queues for some other reason > which may take a while. How about the following instead. We move the is_guilty check down into the queue reset area. Something like this: > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index 100f044759435..48350d1030612 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > @@ -130,8 +130,6 @@ static enum drm_gpu_sched_stat > amdgpu_job_timedout(struct drm_sched_job *s_job) > amdgpu_vm_put_task_info(ti); > } > > - dma_fence_set_error(&s_job->s_fence->finished, -ETIME); > - > /* attempt a per ring reset */ > if (amdgpu_gpu_recovery && > ring->funcs->reset) { > @@ -146,13 +144,22 @@ static enum drm_gpu_sched_stat > amdgpu_job_timedout(struct drm_sched_job *s_job) > if (amdgpu_ring_sched_ready(ring)) > drm_sched_stop(&ring->sched, s_job); > atomic_inc(&ring->adev->gpu_reset_counter); > - amdgpu_fence_driver_force_completion(ring); > + if (ring->funcs->is_guilty) { > + if (ring->funcs->is_guilty(ring)) { > + > + dma_fence_set_error(&s_job->s_fence- > >finished, -ETIME); > + amdgpu_fence_driver_force_completion(ring); > + } > + } else { > + amdgpu_fence_driver_force_completion(ring); > + > + dma_fence_set_error(&s_job->s_fence->finished, - > ETIME); > + } > if (amdgpu_ring_sched_ready(ring)) > drm_sched_start(&ring->sched, 0); > goto exit; > } > dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name); > } > + dma_fence_set_error(&s_job->s_fence->finished, -ETIME); > > if (amdgpu_device_should_recover_gpu(ring->adev)) { > struct amdgpu_reset_context reset_context; > Actually something like the attached patch. Need to call is_guilty before reset. Thanks Alex, will update before pushing to drm-next Jesse Alex > > > > Suggested-by: Alex Deucher <alexander.deucher@xxxxxxx> > > Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > index 100f04475943..f94c876db72b 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > @@ -101,6 +101,16 @@ static enum drm_gpu_sched_stat > > amdgpu_job_timedout(struct drm_sched_job *s_job) > > /* Effectively the job is aborted as the device is gone */ > > return DRM_GPU_SCHED_STAT_ENODEV; > > } > > + /* Check if the ring is actually guilty of causing the timeout. > > + * If not, skip error handling and fence completion. > > + */ > > + if (amdgpu_gpu_recovery && ring->funcs->is_guilty) { > > + if (!ring->funcs->is_guilty(ring)) { > > + dev_err(adev->dev, "ring %s timeout, but not guilty\n", > > + s_job->sched->name); > > + goto exit; > > + } > > + } > > > > /* > > * Do the coredump immediately after a job timeout to get a very > > -- > > 2.25.1