drm_sched_job_finish() is a work item scheduled for each finished job on a unbound system workqueue. This means the workers can execute out of order with regard to the real hardware job completions. If this happens queueing a timeout worker for the first job on the ring mirror list is wrong, as this may be a job which has already finished executing. Fix this by reorganizing the code to always queue the worker for the next job on the list, if this job hasn't finished yet. This is robust against a potential reordering of the finish workers. Also move out the timeout worker cancelling, so that we don't need to take the job list lock twice. As a small optimization list_del is used to remove the job from the ring mirror list, as there is no need to reinit the list head in the job we are about to free. Signed-off-by: Lucas Stach <l.stach@xxxxxxxxxxxxxx> Reviewed-by: Christian König <christian.koenig@xxxxxxx> --- v2: - properly handle last job in the ring - check correct fence for completion v3: - drop condition before cancel_delayed_work_sync - add a comment explaining how we avoid the timeout to be restarted without removing the job from the ring mirror first --- drivers/gpu/drm/scheduler/gpu_scheduler.c | 30 +++++++++++++---------- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c index 44d480768dfe..fff3716ece6f 100644 --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c @@ -452,24 +452,28 @@ static void drm_sched_job_finish(struct work_struct *work) finish_work); struct drm_gpu_scheduler *sched = s_job->sched; - /* remove job from ring_mirror_list */ - spin_lock(&sched->job_list_lock); - list_del_init(&s_job->node); - if (sched->timeout != MAX_SCHEDULE_TIMEOUT) { - struct drm_sched_job *next; - - spin_unlock(&sched->job_list_lock); - cancel_delayed_work_sync(&s_job->work_tdr); - spin_lock(&sched->job_list_lock); + /* + * Canceling the timeout without removing our job from the ring mirror + * list is safe, as we will only end up in this worker if our jobs + * finished fence has been signaled. So even if some another worker + * manages to find this job as the next job in the list, the fence + * signaled check below will prevent the timeout to be restarted. + */ + cancel_delayed_work_sync(&s_job->work_tdr); - /* queue TDR for next job */ - next = list_first_entry_or_null(&sched->ring_mirror_list, - struct drm_sched_job, node); + spin_lock(&sched->job_list_lock); + /* queue TDR for next job */ + if (sched->timeout != MAX_SCHEDULE_TIMEOUT && + !list_is_last(&s_job->node, &sched->ring_mirror_list)) { + struct drm_sched_job *next = list_next_entry(s_job, node); - if (next) + if (!dma_fence_is_signaled(&next->s_fence->finished)) schedule_delayed_work(&next->work_tdr, sched->timeout); } + /* remove job from ring_mirror_list */ + list_del(&s_job->node); spin_unlock(&sched->job_list_lock); + dma_fence_put(&s_job->s_fence->finished); sched->ops->free_job(s_job); } -- 2.18.0 _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel