On Wed, Sep 13, 2023 at 10:38:24PM -0400, Luben Tuikov wrote: > On 2023-09-11 22:16, Matthew Brost wrote: > > Add helper to set TDR timeout and restart the TDR with new timeout > > value. This will be used in XE, new Intel GPU driver, to trigger the TDR > > to cleanup drm_sched_entity that encounter errors. > > Do you just want to trigger the cleanup or do you really want to > modify the timeout and requeue TDR delayed work (to be triggered > later at a different time)? > > If the former, then might as well just add a function to queue that > right away. If the latter, then this would do, albeit with a few Let me change the function to queue it immediately as that is what is needed. Matt > notes as mentioned below. > > > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++ > > include/drm/gpu_scheduler.h | 1 + > > 2 files changed, 19 insertions(+) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index 9dbfab7be2c6..689fb6686e01 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -426,6 +426,24 @@ static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched) > > spin_unlock(&sched->job_list_lock); > > } > > > > +/** > > + * drm_sched_set_timeout - set timeout for reset worker > > + * > > + * @sched: scheduler instance to set and (re)-start the worker for > > + * @timeout: timeout period > > + * > > + * Set and (re)-start the timeout for the given scheduler. > > + */ > > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout) > > +{ > > Well, I'd perhaps call this "drm_sched_set_tdr_timeout()", or something > to that effect, as "drm_sched_set_timeout()" isn't clear that it is indeed > a cleanup timeout. However, it's totally up to you. :-) > > It appears that "long timeout" is the new job timeout, so it is possible > that a stuck job might be given old timeout + new timeout recovery time, > after this function is called. > > > + spin_lock(&sched->job_list_lock); > > + sched->timeout = timeout; > > + cancel_delayed_work(&sched->work_tdr); > > + drm_sched_start_timeout(sched); > > + spin_unlock(&sched->job_list_lock); > > +} > > +EXPORT_SYMBOL(drm_sched_set_timeout); > > Well, drm_sched_start_timeout() (which also has a name lacking description, perhaps > it should be "drm_sched_start_tdr_timeout()" or "...start_cleanup_timeout()"), anyway, > so that function compares to MAX_SCHEDULE_TIMEOUT and pending list not being empty > before it requeues delayed TDR work item. So, while a remote possibility, this new > function may have the unintended consequence of canceling TDR, and never restarting it. > I see it grabs the lock, however. Maybe wrap it in "if (sched->timeout != MAX_SCHEDULE_TIMEOUT)"? > How about using mod_delayed_work()? > -- > Regards, > Luben > > > + > > /** > > * drm_sched_fault - immediately start timeout handler > > * > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > > index 5d753ecb5d71..b7b818cd81b6 100644 > > --- a/include/drm/gpu_scheduler.h > > +++ b/include/drm/gpu_scheduler.h > > @@ -596,6 +596,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, > > struct drm_gpu_scheduler **sched_list, > > unsigned int num_sched_list); > > > > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout); > > void drm_sched_job_cleanup(struct drm_sched_job *job); > > void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched); > > void drm_sched_add_msg(struct drm_gpu_scheduler *sched, >