Re: [PATCH 3/6] drm/scheduler: Job timeout handler returns status

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 25.11.20 um 04:17 schrieb Luben Tuikov:
The job timeout handler now returns status
indicating back to the DRM layer whether the job
was successfully cancelled or whether more time
should be given to the job to complete.

Signed-off-by: Luben Tuikov <luben.tuikov@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  6 ++++--
  include/drm/gpu_scheduler.h             | 13 ++++++++++---
  2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index ff48101bab55..81b73790ecc6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -28,7 +28,7 @@
  #include "amdgpu.h"
  #include "amdgpu_trace.h"
-static void amdgpu_job_timedout(struct drm_sched_job *s_job)
+static int amdgpu_job_timedout(struct drm_sched_job *s_job)
  {
  	struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
  	struct amdgpu_job *job = to_amdgpu_job(s_job);
@@ -41,7 +41,7 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
  	    amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
  		DRM_ERROR("ring %s timeout, but soft recovered\n",
  			  s_job->sched->name);
-		return;
+		return 0;
  	}
amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
@@ -53,10 +53,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
if (amdgpu_device_should_recover_gpu(ring->adev)) {
  		amdgpu_device_gpu_recover(ring->adev, job);
+		return 0;
  	} else {
  		drm_sched_suspend_timeout(&ring->sched);
  		if (amdgpu_sriov_vf(adev))
  			adev->virt.tdr_debug = true;
+		return 1;
  	}
  }
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 2e0c368e19f6..61f7121e1c19 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -230,10 +230,17 @@ struct drm_sched_backend_ops {
  	struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
/**
-         * @timedout_job: Called when a job has taken too long to execute,
-         * to trigger GPU recovery.
+	 * @timedout_job: Called when a job has taken too long to execute,
+	 * to trigger GPU recovery.
+	 *
+	 * Return 0, if the job has been aborted successfully and will
+	 * never be heard of from the device. Return non-zero if the
+	 * job wasn't able to be aborted, i.e. if more time should be
+	 * given to this job. The result is not "bool" as this
+	 * function is not a predicate, although its result may seem
+	 * as one.

I think the whole approach of timing out a job needs to be rethinked. What's timing out here is the hardware engine, not the job.

So we should also not have the job as parameter here. Maybe we should make that the fence we are waiting for instead.

  	 */
-	void (*timedout_job)(struct drm_sched_job *sched_job);
+	int (*timedout_job)(struct drm_sched_job *sched_job);

I would either return an error code, boolean or enum here. But not use a number without a define.

Regards,
Christian.

/**
           * @free_job: Called once the job's finished fence has been signaled

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux