On Thu, Aug 26, 2021 at 02:37:40PM +0200, Christian König wrote: > Am 26.08.21 um 13:55 schrieb Liu, Monk: > > [AMD Official Use Only] > > > > > > I'm not sure if the work_tdr is initialized when a maximum timeout is specified. Please double check. > > Ok, will do > > > > > > BTW: Can we please drop the "tdr" naming from the scheduler? That is just a timeout functionality and not related to recovery in any way. > > We even do not start hardware recovery in a lot of cases now (when wave kill is successfully). > > > > Umm, sounds reasonable, I can rename it to "to" with another patch > > Maybe more like job_timeout or timeout_work or something into that > direction. Yeah that's better. TO is even worse I think than TDR, which is at least somewhat well-known from the windows side. Also would be good to polish the commit message a bit, there's a few typos and confusing wording. -Daniel > > Christian. > > > > > Thanks > > > > ------------------------------------------ > > Monk Liu | Cloud-GPU Core team > > ------------------------------------------ > > > > -----Original Message----- > > From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > > Sent: Thursday, August 26, 2021 6:09 PM > > To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx > > Subject: Re: [PATCH] drm/sched: fix the bug of time out calculation(v3) > > > > Am 26.08.21 um 06:55 schrieb Monk Liu: > > > issue: > > > in cleanup_job the cancle_delayed_work will cancel a TO timer even the > > > its corresponding job is still running. > > Yeah, that makes a lot more sense. > > > > > fix: > > > do not cancel the timer in cleanup_job, instead do the cancelling only > > > when the heading job is signaled, and if there is a "next" job we > > > start_timeout again. > > > > > > v2: > > > further cleanup the logic, and do the TDR timer cancelling if the > > > signaled job is the last one in its scheduler. > > > > > > v3: > > > change the issue description > > > remove the cancel_delayed_work in the begining of the cleanup_job > > > recover the implement of drm_sched_job_begin. > > > > > > TODO: > > > 1)introduce pause/resume scheduler in job_timeout to serial the > > > handling of scheduler and job_timeout. > > > 2)drop the bad job's del and insert in scheduler due to above > > > serialization (no race issue anymore with the serialization) > > > > > > Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx> > > > --- > > > drivers/gpu/drm/scheduler/sched_main.c | 25 ++++++++++--------------- > > > 1 file changed, 10 insertions(+), 15 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > index a2a9536..ecf8140 100644 > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > @@ -676,13 +676,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) > > > { > > > struct drm_sched_job *job, *next; > > > - /* > > > - * Don't destroy jobs while the timeout worker is running OR thread > > > - * is being parked and hence assumed to not touch pending_list > > > - */ > > > - if ((sched->timeout != MAX_SCHEDULE_TIMEOUT && > > > - !cancel_delayed_work(&sched->work_tdr)) || > > > - kthread_should_park()) > > > + if (kthread_should_park()) > > > return NULL; > > > spin_lock(&sched->job_list_lock); > > > @@ -693,17 +687,21 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) > > > if (job && dma_fence_is_signaled(&job->s_fence->finished)) { > > > /* remove job from pending_list */ > > > list_del_init(&job->list); > > > + > > > + /* cancel this job's TO timer */ > > > + cancel_delayed_work(&sched->work_tdr); > > I'm not sure if the work_tdr is initialized when a maximum timeout is specified. Please double check. > > > > BTW: Can we please drop the "tdr" naming from the scheduler? That is just a timeout functionality and not related to recovery in any way. > > > > We even do not start hardware recovery in a lot of cases now (when wave kill is successfully). > > > > Regards, > > Christian. > > > > > /* make the scheduled timestamp more accurate */ > > > next = list_first_entry_or_null(&sched->pending_list, > > > typeof(*next), list); > > > - if (next) > > > + > > > + if (next) { > > > next->s_fence->scheduled.timestamp = > > > job->s_fence->finished.timestamp; > > > - > > > + /* start TO timer for next job */ > > > + drm_sched_start_timeout(sched); > > > + } > > > } else { > > > job = NULL; > > > - /* queue timeout for next job */ > > > - drm_sched_start_timeout(sched); > > > } > > > spin_unlock(&sched->job_list_lock); > > > @@ -791,11 +789,8 @@ static int drm_sched_main(void *param) > > > (entity = drm_sched_select_entity(sched))) || > > > kthread_should_stop()); > > > - if (cleanup_job) { > > > + if (cleanup_job) > > > sched->ops->free_job(cleanup_job); > > > - /* queue timeout for next job */ > > > - drm_sched_start_timeout(sched); > > > - } > > > if (!entity) > > > continue; > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch