Hi Christian, I am a bit confused with this patch. It will be better if you can explain what these 3 functions are supposed to do. From my understanding this is what they are supposed to do: 1. drm_sched_job_timedout: This function is called when a job has been executing for more than "timeout" seconds. This function needs to identify the wrong job and call the timedout_job function which will notify the driver of that job and the gpu should be recovered to its original state. 2. drm_sched_hw_job_reset: stop the scheduler if it contains the bad job. ATM this function first removes all the callback and goes through the ring_mirror_list afterward to find the bad job. I think it should do it in the opposite order. Or my understanding of this function is not fully correct. 3. drm_sched_job_recovery: recover jobs after a reset. It resubmits the job to the hardware queue avoiding the guilty job. Some other questions and suggestions: 1. We can avoid the race condition altogether if we shift the canceling and rescheduling of the timedout work item to the drm_sched_process_job(). 2. How does removing and adding the callback help with the race condition? Moreover, the hardware might execute some jobs while we are in this function leading to more races. 3. What is the order in which the above 3 functions should be executed by the hardware? I think the answer to this question might clear a lot of my doubts. Regards, Nayan On Mon, Oct 8, 2018 at 8:36 PM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > We need to make sure that we don't race between job completion and > timeout. > > Signed-off-by: Christian König <christian.koenig@xxxxxxx> > --- > drivers/gpu/drm/scheduler/sched_main.c | 28 +++++++++++++++++++++++++++- > 1 file changed, 27 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index bd7d11c47202..ad3c57c9fd21 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -248,14 +248,40 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job) > static void drm_sched_job_timedout(struct work_struct *work) > { > struct drm_gpu_scheduler *sched; > + struct drm_sched_fence *fence; > struct drm_sched_job *job; > + int r; > > sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work); > + > + spin_lock(&sched->job_list_lock); > + list_for_each_entry_reverse(job, &sched->ring_mirror_list, node) { > + fence = job->s_fence; > + if (!dma_fence_remove_callback(fence->parent, &fence->cb)) > + goto already_signaled; > + } > + > job = list_first_entry_or_null(&sched->ring_mirror_list, > struct drm_sched_job, node); > + spin_unlock(&sched->job_list_lock); > > if (job) > - job->sched->ops->timedout_job(job); > + sched->ops->timedout_job(job); > + > + spin_lock(&sched->job_list_lock); > + list_for_each_entry(job, &sched->ring_mirror_list, node) { > + fence = job->s_fence; > + if (!fence->parent || !list_empty(&fence->cb.node)) > + continue; > + > + r = dma_fence_add_callback(fence->parent, &fence->cb, > + drm_sched_process_job); > + if (r) > +already_signaled: > + drm_sched_process_job(fence->parent, &fence->cb); > + > + } > + spin_unlock(&sched->job_list_lock); > } > > /** > -- > 2.14.1 > > _______________________________________________ > dri-devel mailing list > dri-devel@xxxxxxxxxxxxxxxxxxxxx > https://lists.freedesktop.org/mailman/listinfo/dri-devel _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel