[AMD Official Use Only - AMD Internal Distribution Only] -----Original Message----- From: Koenig, Christian <Christian.Koenig@xxxxxxx> Sent: Wednesday, January 15, 2025 7:56 PM To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Tim <Tim.Huang@xxxxxxx>; Prosyak, Vitaly <Vitaly.Prosyak@xxxxxxx> Subject: Re: [PATCH] drm/amdgpu: Use -ENODATA for GPU job timeout queue recovery Am 15.01.25 um 07:52 schrieb Jesse.zhang@xxxxxxx: > When a GPU job times out, the driver attempts to recover by restarting > the scheduler. Previously, the scheduler was restarted with an error > code of 0, which does not distinguish between a full GPU reset and a > queue reset. This patch changes the error code to -ENODATA for queue > resets, while -ECANCELED or -ETIME is used for full GPU resets. > > This change improves error handling by: > 1. Clearly differentiating between queue resets and full GPU resets. > 2. Providing more specific error codes for better debugging and recovery. > 3. Aligning with kernel best practices for error reporting. > > The related commit "b2ef808786d93df3658" (drm/sched: add optional > errno to drm_sched_start()) introduced support for passing an error > code to drm_sched_start(), enabling this improvement. I'm about to remove the scheduler stop/start for queue resets which would make this here superfluous. On the other hand I'm not sure when I will be done with that work. So could be that this will take a while and we should commit this in the meantime. Thanks Christian, I hold this patch till you finish it. Thanks Jesse Regards, Christian. > > Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx> > Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index 100f04475943..b18b316872a0 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > @@ -148,7 +148,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) > atomic_inc(&ring->adev->gpu_reset_counter); > amdgpu_fence_driver_force_completion(ring); > if (amdgpu_ring_sched_ready(ring)) > - drm_sched_start(&ring->sched, 0); > + drm_sched_start(&ring->sched, -ENODATA); > goto exit; > } > dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);