On 2023-07-30 21:09, Matthew Brost wrote: > On Thu, May 04, 2023 at 01:28:12AM -0400, Luben Tuikov wrote: >> On 2023-04-03 20:22, Matthew Brost wrote: >>> Add helper to set TDR timeout and restart the TDR with new timeout >>> value. This will be used in XE, new Intel GPU driver, to trigger the TDR >>> to cleanup drm_sched_entity that encounter errors. >>> >>> Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++ >>> include/drm/gpu_scheduler.h | 1 + >>> 2 files changed, 19 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index 4eac02d212c1..d61880315d8d 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) >>> queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout); >>> } >>> >>> +/** >>> + * drm_sched_set_timeout - set timeout for reset worker >>> + * >>> + * @sched: scheduler instance to set and (re)-start the worker for >>> + * @timeout: timeout period >>> + * >>> + * Set and (re)-start the timeout for the given scheduler. >>> + */ >>> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout) >>> +{ >>> + spin_lock(&sched->job_list_lock); >>> + sched->timeout = timeout; >>> + cancel_delayed_work(&sched->work_tdr); >> >> I see that the comment says "(re-)start"(sic). Is the rest of the logic >> stable in that we don't need to use _sync() version, and/or at least >> inspect the return value of the one currently used? >> > > Sorry for the delayed response, just reviving this series now and seeing > this comment. > > We don't care if the TDR is currently executing (at least in Xe which > makes use of this function), that is totally fine we only care to change > the future timeout values. I believe we actually call this from the TDR > in Xe to set the timeout value to zero so using a sync version would > deadlock. We do this as a mechanism to kill the drm_gpu_scheduler and > immediately timeout all remaining jobs. We also call this in a few other > places too with a value of zero for the same reason (kill the > drm_gpu_scheduler). (Catching up chronologically after vacation...) Okay, that's fine, but this shows a need for an interface/logic to simply kill the DRM gpu scheduler. So perhaps we need to provide that kind of functionality, as opposed to gaming the scheduler--setting the timeout to 0 to kill the scheduler. Perhaps that would be simpler...? -- Regards, Luben > > Matt > >> Regards, >> Luben >> >>> + drm_sched_start_timeout(sched); >>> + spin_unlock(&sched->job_list_lock); >>> +} >>> +EXPORT_SYMBOL(drm_sched_set_timeout); >>> + >>> /** >>> * drm_sched_fault - immediately start timeout handler >>> * >>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h >>> index 18172ae63ab7..6258e324bd7c 100644 >>> --- a/include/drm/gpu_scheduler.h >>> +++ b/include/drm/gpu_scheduler.h >>> @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, >>> struct drm_gpu_scheduler **sched_list, >>> unsigned int num_sched_list); >>> >>> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout); >>> void drm_sched_job_cleanup(struct drm_sched_job *job); >>> void drm_sched_wakeup(struct drm_gpu_scheduler *sched); >>> void drm_sched_add_msg(struct drm_gpu_scheduler *sched, >>