Re: [PATCH] drm/sched: Only start TDR in drm_sched_job_begin on first job

Matthew Brost <matthew.brost@xxxxxxxxx> · Thu, 25 Jul 2024 14:50:54 +0000

On Thu, Jul 25, 2024 at 09:42:08AM +0200, Christian König wrote:
> Am 25.07.24 um 01:44 schrieb Matthew Brost:
> > Only start in drm_sched_job_begin on first job being added to the
> > pending list as if pending list non-empty the TDR has already been
> > started. It is problematic to restart the TDR as it will extend TDR
> > period for an already running job, potentially leading to dma-fence
> > signaling for a very long period of with continous stream of jobs.
> 
> Mhm, that should be unnecessary. drm_sched_start_timeout() should only start
> the timeout, but never re-start it.
> 

That function checks the pending list for not empty, so it indeed starts
it. Which is the correct behavior for some of the callers, e.g.
drm_sched_tdr_queue_imm, drm_sched_get_finished_job

IMO best to fix this here.

Also FWIW on Xe I wrote a test which submitted a new ending spinner,
then submitted a job every second on the same queue in a loop and
observed the spinner not get canceled for a long time. After this patch,
the spinner correctly timed out after 5 second (our default TDR period).

Matt

> Could be that this isn't working properly.
> 
> Regards,
> Christian.
> 
> > 
> > Cc: Christian König <christian.koenig@xxxxxxx>
> > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 7e90c9f95611..feeeb9dbeb86 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -540,7 +540,8 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
> >   	spin_lock(&sched->job_list_lock);
> >   	list_add_tail(&s_job->list, &sched->pending_list);
> > -	drm_sched_start_timeout(sched);
> > +	if (list_is_singular(&sched->pending_list))
> > +		drm_sched_start_timeout(sched);
> >   	spin_unlock(&sched->job_list_lock);
> >   }
>