Re: [PATCH v3 10/13] drm/sched: Add helper to set TDR timeout

Matthew Brost <matthew.brost@xxxxxxxxx> · Thu, 14 Sep 2023 17:36:27 +0000

On Wed, Sep 13, 2023 at 10:38:24PM -0400, Luben Tuikov wrote:
> On 2023-09-11 22:16, Matthew Brost wrote:
> > Add helper to set TDR timeout and restart the TDR with new timeout
> > value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> > to cleanup drm_sched_entity that encounter errors.
> 
> Do you just want to trigger the cleanup or do you really want to
> modify the timeout and requeue TDR delayed work (to be triggered
> later at a different time)?
> 
> If the former, then might as well just add a function to queue that
> right away. If the latter, then this would do, albeit with a few

Let me change the function to queue it immediately as that is what is
needed.

Matt

> notes as mentioned below.
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
> >  include/drm/gpu_scheduler.h            |  1 +
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 9dbfab7be2c6..689fb6686e01 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -426,6 +426,24 @@ static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched)
> >  	spin_unlock(&sched->job_list_lock);
> >  }
> >  
> > +/**
> > + * drm_sched_set_timeout - set timeout for reset worker
> > + *
> > + * @sched: scheduler instance to set and (re)-start the worker for
> > + * @timeout: timeout period
> > + *
> > + * Set and (re)-start the timeout for the given scheduler.
> > + */
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> > +{
> 
> Well, I'd perhaps call this "drm_sched_set_tdr_timeout()", or something
> to that effect, as "drm_sched_set_timeout()" isn't clear that it is indeed
> a cleanup timeout. However, it's totally up to you. :-)
> 
> It appears that "long timeout" is the new job timeout, so it is possible
> that a stuck job might be given old timeout + new timeout recovery time,
> after this function is called.
> 
> > +	spin_lock(&sched->job_list_lock);
> > +	sched->timeout = timeout;
> > +	cancel_delayed_work(&sched->work_tdr);
> > +	drm_sched_start_timeout(sched);
> > +	spin_unlock(&sched->job_list_lock);
> > +}
> > +EXPORT_SYMBOL(drm_sched_set_timeout);
> 
> Well, drm_sched_start_timeout() (which also has a name lacking description, perhaps
> it should be "drm_sched_start_tdr_timeout()" or "...start_cleanup_timeout()"), anyway,
> so that function compares to MAX_SCHEDULE_TIMEOUT and pending list not being empty
> before it requeues delayed TDR work item. So, while a remote possibility, this new
> function may have the unintended consequence of canceling TDR, and never restarting it.
> I see it grabs the lock, however. Maybe wrap it in "if (sched->timeout != MAX_SCHEDULE_TIMEOUT)"?
> How about using mod_delayed_work()?
> -- 
> Regards,
> Luben
> 
> > +
> >  /**
> >   * drm_sched_fault - immediately start timeout handler
> >   *
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 5d753ecb5d71..b7b818cd81b6 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -596,6 +596,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> >  				    struct drm_gpu_scheduler **sched_list,
> >                                     unsigned int num_sched_list);
> >  
> > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> >  void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> >  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>