On Wed, Sep 1, 2021 at 2:50 AM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > Am 01.09.21 um 02:46 schrieb Monk Liu: > > issue: > > in cleanup_job the cancle_delayed_work will cancel a TO timer > > even the its corresponding job is still running. > > > > fix: > > do not cancel the timer in cleanup_job, instead do the cancelling > > only when the heading job is signaled, and if there is a "next" job > > we start_timeout again. > > > > v2: > > further cleanup the logic, and do the TDR timer cancelling if the signaled job > > is the last one in its scheduler. > > > > v3: > > change the issue description > > remove the cancel_delayed_work in the begining of the cleanup_job > > recover the implement of drm_sched_job_begin. > > > > v4: > > remove the kthread_should_park() checking in cleanup_job routine, > > we should cleanup the signaled job asap > > > > TODO: > > 1)introduce pause/resume scheduler in job_timeout to serial the handling > > of scheduler and job_timeout. > > 2)drop the bad job's del and insert in scheduler due to above serialization > > (no race issue anymore with the serialization) > > > > tested-by: jingwen <jingwen.chen@@amd.com> > > Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx> > > Reviewed-by: Christian König <christian.koenig@xxxxxxx> > Are you planning to push this to drm-misc? Alex > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 26 +++++++++----------------- > > 1 file changed, 9 insertions(+), 17 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index a2a9536..3e0bbc7 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) > > { > > struct drm_sched_job *job, *next; > > > > - /* > > - * Don't destroy jobs while the timeout worker is running OR thread > > - * is being parked and hence assumed to not touch pending_list > > - */ > > - if ((sched->timeout != MAX_SCHEDULE_TIMEOUT && > > - !cancel_delayed_work(&sched->work_tdr)) || > > - kthread_should_park()) > > - return NULL; > > - > > spin_lock(&sched->job_list_lock); > > > > job = list_first_entry_or_null(&sched->pending_list, > > @@ -693,17 +684,21 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) > > if (job && dma_fence_is_signaled(&job->s_fence->finished)) { > > /* remove job from pending_list */ > > list_del_init(&job->list); > > + > > + /* cancel this job's TO timer */ > > + cancel_delayed_work(&sched->work_tdr); > > /* make the scheduled timestamp more accurate */ > > next = list_first_entry_or_null(&sched->pending_list, > > typeof(*next), list); > > - if (next) > > + > > + if (next) { > > next->s_fence->scheduled.timestamp = > > job->s_fence->finished.timestamp; > > - > > + /* start TO timer for next job */ > > + drm_sched_start_timeout(sched); > > + } > > } else { > > job = NULL; > > - /* queue timeout for next job */ > > - drm_sched_start_timeout(sched); > > } > > > > spin_unlock(&sched->job_list_lock); > > @@ -791,11 +786,8 @@ static int drm_sched_main(void *param) > > (entity = drm_sched_select_entity(sched))) || > > kthread_should_stop()); > > > > - if (cleanup_job) { > > + if (cleanup_job) > > sched->ops->free_job(cleanup_job); > > - /* queue timeout for next job */ > > - drm_sched_start_timeout(sched); > > - } > > > > if (!entity) > > continue; >