Re: [PATCH] drm/amdgpu: Use -ENODATA for GPU job timeout queue recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 15.01.25 um 07:52 schrieb Jesse.zhang@xxxxxxx:
When a GPU job times out, the driver attempts to recover by restarting
the scheduler. Previously, the scheduler was restarted with an error
code of 0, which does not distinguish between a full GPU reset and a
queue reset. This patch changes the error code to -ENODATA for queue
resets, while -ECANCELED or -ETIME is used for full GPU resets.

This change improves error handling by:
1. Clearly differentiating between queue resets and full GPU resets.
2. Providing more specific error codes for better debugging and recovery.
3. Aligning with kernel best practices for error reporting.

The related commit "b2ef808786d93df3658" (drm/sched: add optional errno
to drm_sched_start())
introduced support for passing an error code to
drm_sched_start(), enabling this improvement.

I'm about to remove the scheduler stop/start for queue resets which would make this here superfluous.

On the other hand I'm not sure when I will be done with that work. So could be that this will take a while and we should commit this in the meantime.

Regards,
Christian.


Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>
Signed-off-by: Jesse Zhang <jesse.zhang@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 100f04475943..b18b316872a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -148,7 +148,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
  			atomic_inc(&ring->adev->gpu_reset_counter);
  			amdgpu_fence_driver_force_completion(ring);
  			if (amdgpu_ring_sched_ready(ring))
-				drm_sched_start(&ring->sched, 0);
+				drm_sched_start(&ring->sched, -ENODATA);
  			goto exit;
  		}
  		dev_err(adev->dev, "Ring %s reset failure\n", ring->sched.name);




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux