Am Montag, den 06.08.2018, 10:12 +0200 schrieb Christian König: > Am 03.08.2018 um 19:31 schrieb Lucas Stach: > > Am Montag, den 06.08.2018, 14:57 +0200 schrieb Christian König: > > > Am 03.08.2018 um 16:29 schrieb Lucas Stach: > > > > drm_sched_job_finish() is a work item scheduled for each finished job on > > > > a unbound system workqueue. This means the workers can execute out of order > > > > with regard to the real hardware job completions. > > > > > > > > If this happens queueing a timeout worker for the first job on the ring > > > > mirror list is wrong, as this may be a job which has already finished > > > > executing. Fix this by reorganizing the code to always queue the worker > > > > for the next job on the list, if this job hasn't finished yet. This is > > > > robust against a potential reordering of the finish workers. > > > > > > > > Also move out the timeout worker cancelling, so that we don't need to > > > > take the job list lock twice. As a small optimization list_del is used > > > > to remove the job from the ring mirror list, as there is no need to > > > > reinit the list head in the job we are about to free. > > > > > > > > > > Signed-off-by: Lucas Stach <l.stach@xxxxxxxxxxxxxx> > > > > > > > > --- > > > > v2: - properly handle last job in the ring > > > > - check correct fence for compeletion > > > > --- > > > > drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 ++++++++++------------ > > > > 1 file changed, 10 insertions(+), 12 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c > > > > index 44d480768dfe..574875e2c206 100644 > > > > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > > > > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > > > > @@ -452,24 +452,22 @@ static void drm_sched_job_finish(struct work_struct *work) > > > > > > > > > > > > finish_work); > > > > > > struct drm_gpu_scheduler *sched = s_job->sched; > > > > > > > > > > > > > > > > > > > > - /* remove job from ring_mirror_list */ > > > > > > > > > > > > - spin_lock(&sched->job_list_lock); > > > > > > > > > > > > - list_del_init(&s_job->node); > > > > > > > > > > > > - if (sched->timeout != MAX_SCHEDULE_TIMEOUT) { > > > > > > - struct drm_sched_job *next; > > > > > > > > - > > > > > > > > > > > > - spin_unlock(&sched->job_list_lock); > > > > > > + if (sched->timeout != MAX_SCHEDULE_TIMEOUT) > > > > > > > > cancel_delayed_work_sync(&s_job->work_tdr); > > > > > > That is unfortunately still racy here. > > > > > > Between canceling the job and removing it from the list someone could > > > actually start the time (in theory) :) > > > > > > Cancel it, remove it from the list and cancel it again. > > > > I don't see how. If we end up in this worker the finished fence of the > > job is already certainly signaled (as this is what triggers queueing of > > the worker). So even if some other worker manages to find this job as > > the next job in the list, the dma_fence_is_signaled check should > > prevent the timeout worker from getting scheduled again. > > Well that makes sense, but is a bit hard to understand. I agree, all the possible parallelism and possible re-ordering makes this seemingly simple part of the scheduler code a bit mind-breaking. I've added a comment in v3 to capture the line of thought for future reference. Regards, Lucas > Anyway, please remove the extra "if" check. With that done the patch is > > Reviewed-by: Christian König <christian.koenig@xxxxxxx>. > > Thanks, > Christian. > > > > > > > BTW: You could completely drop the "if (sched->timeout != > > > MAX_SCHEDULE_TIMEOUT)" here cause canceling is harmless as long as the > > > structure is initialized. > > > > Right. > > > > Regards, > > Lucas > > > > > Christian. > > > > > > > > > - spin_lock(&sched->job_list_lock); > > > > > > > > > > > > > > > > > > > > - /* queue TDR for next job */ > > > > > > > > > > > > - next = list_first_entry_or_null(&sched->ring_mirror_list, > > > > > > > > > > > > - struct drm_sched_job, node); > > > > > > > > > > > > + spin_lock(&sched->job_list_lock); > > > > > > > > > > > > + /* queue TDR for next job */ > > > > > > > > > > > > + if (sched->timeout != MAX_SCHEDULE_TIMEOUT && > > > > > > > > > > > > + !list_is_last(&s_job->node, &sched->ring_mirror_list)) { > > > > > > + struct drm_sched_job *next = list_next_entry(s_job, node); > > > > > > > > > > > > > > > > > > > > - if (next) > > > > > > > > > > > > + if (!dma_fence_is_signaled(&next->s_fence->finished)) > > > > > > > > > > > > schedule_delayed_work(&next->work_tdr, sched->timeout); > > > > > > > > > > > > } > > > > > > > > > > > > + /* remove job from ring_mirror_list */ > > > > > > > > > > > > + list_del(&s_job->node); > > > > > > spin_unlock(&sched->job_list_lock); > > > > > > > > + > > > > > > > > > > > > dma_fence_put(&s_job->s_fence->finished); > > > > > > sched->ops->free_job(s_job); > > > > > > > > } > > > > _______________________________________________ > > amd-gfx mailing list > > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel