On 2020-11-25 04:50, Christian König wrote: > Am 25.11.20 um 04:17 schrieb Luben Tuikov: >> The job timeout handler now returns status >> indicating back to the DRM layer whether the job >> was successfully cancelled or whether more time >> should be given to the job to complete. >> >> Signed-off-by: Luben Tuikov <luben.tuikov@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 ++++-- >> include/drm/gpu_scheduler.h | 13 ++++++++++--- >> 2 files changed, 14 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> index ff48101bab55..81b73790ecc6 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> @@ -28,7 +28,7 @@ >> #include "amdgpu.h" >> #include "amdgpu_trace.h" >> >> -static void amdgpu_job_timedout(struct drm_sched_job *s_job) >> +static int amdgpu_job_timedout(struct drm_sched_job *s_job) >> { >> struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); >> struct amdgpu_job *job = to_amdgpu_job(s_job); >> @@ -41,7 +41,7 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job) >> amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { >> DRM_ERROR("ring %s timeout, but soft recovered\n", >> s_job->sched->name); >> - return; >> + return 0; >> } >> >> amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); >> @@ -53,10 +53,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job) >> >> if (amdgpu_device_should_recover_gpu(ring->adev)) { >> amdgpu_device_gpu_recover(ring->adev, job); >> + return 0; >> } else { >> drm_sched_suspend_timeout(&ring->sched); >> if (amdgpu_sriov_vf(adev)) >> adev->virt.tdr_debug = true; >> + return 1; >> } >> } >> >> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h >> index 2e0c368e19f6..61f7121e1c19 100644 >> --- a/include/drm/gpu_scheduler.h >> +++ b/include/drm/gpu_scheduler.h >> @@ -230,10 +230,17 @@ struct drm_sched_backend_ops { >> struct dma_fence *(*run_job)(struct drm_sched_job *sched_job); >> >> /** >> - * @timedout_job: Called when a job has taken too long to execute, >> - * to trigger GPU recovery. >> + * @timedout_job: Called when a job has taken too long to execute, >> + * to trigger GPU recovery. >> + * >> + * Return 0, if the job has been aborted successfully and will >> + * never be heard of from the device. Return non-zero if the >> + * job wasn't able to be aborted, i.e. if more time should be >> + * given to this job. The result is not "bool" as this >> + * function is not a predicate, although its result may seem >> + * as one. > > I think the whole approach of timing out a job needs to be rethinked. > What's timing out here is the hardware engine, not the job. > > So we should also not have the job as parameter here. Maybe we should > make that the fence we are waiting for instead. Yes, I wanted this patch to be minimal, and not to disrupt too many things. Yes, in the future we can totally revamp this, but this is a minimal patch. > >> */ >> - void (*timedout_job)(struct drm_sched_job *sched_job); >> + int (*timedout_job)(struct drm_sched_job *sched_job); > > I would either return an error code, boolean or enum here. But not use a > number without a define. Yes, that's a great idea--I'll make the change now, and resubmit. Regards, Luben > > Regards, > Christian. > >> >> /** >> * @free_job: Called once the job's finished fence has been signaled > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx