RE: [PATCH] drm/amdgpu: Use -ENODATA for GPU job timeout queue recovery

"Zhang, Jesse(Jie)" <Jesse.Zhang@xxxxxxx> · Thu, 16 Jan 2025 09:20:15 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

-----Original Message-----
From: Koenig, Christian <Christian.Koenig@xxxxxxx>
Sent: Wednesday, January 15, 2025 7:56 PM
To: Zhang, Jesse(Jie) <Jesse.Zhang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Huang, Tim <Tim.Huang@xxxxxxx>; Prosyak, Vitaly <Vitaly.Prosyak@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: Use -ENODATA for GPU job timeout queue recovery

Am 15.01.25 um 07:52 schrieb Jesse.zhang@xxxxxxx:
> When a GPU job times out, the driver attempts to recover by restarting
> the scheduler. Previously, the scheduler was restarted with an error
> code of 0, which does not distinguish between a full GPU reset and a
> queue reset. This patch changes the error code to -ENODATA for queue
> resets, while -ECANCELED or -ETIME is used for full GPU resets.
>
> This change improves error handling by:
> 1. Clearly differentiating between queue resets and full GPU resets.
> 2. Providing more specific error codes for better debugging and recovery.
> 3. Aligning with kernel best practices for error reporting.
>
> The related commit "b2ef808786d93df3658" (drm/sched: add optional
> errno to drm_sched_start()) introduced support for passing an error
> code to drm_sched_start(), enabling this improvement.

I'm about to remove the scheduler stop/start for queue resets which would make this here superfluous.

On the other hand I'm not sure when I will be done with that work. So could be that this will take a while and we should commit this in the meantime.

Thanks Christian, I hold this patch till you finish it.

Thanks
Jesse

Regards,
Christian.

>
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>
> Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 100f04475943..b18b316872a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -148,7 +148,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>                       atomic_inc(&ring->adev->gpu_reset_counter);
>                       amdgpu_fence_driver_force_completion(ring);
>                       if (amdgpu_ring_sched_ready(ring))
> -                             drm_sched_start(&ring->sched, 0);
> +                             drm_sched_start(&ring->sched, -ENODATA);
>                       goto exit;
>               }
>               dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);