Re: [RFC PATCH 07/10] drm/sched: Add helper to set TDR timeout

Luben Tuikov <luben.tuikov@xxxxxxx> · Thu, 31 Aug 2023 15:52:19 -0400



On 2023-07-30 21:09, Matthew Brost wrote:
> On Thu, May 04, 2023 at 01:28:12AM -0400, Luben Tuikov wrote:
>> On 2023-04-03 20:22, Matthew Brost wrote:
>>> Add helper to set TDR timeout and restart the TDR with new timeout
>>> value. This will be used in XE, new Intel GPU driver, to trigger the TDR
>>> to cleanup drm_sched_entity that encounter errors.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
>>>  include/drm/gpu_scheduler.h            |  1 +
>>>  2 files changed, 19 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 4eac02d212c1..d61880315d8d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>>>  		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
>>>  }
>>>  
>>> +/**
>>> + * drm_sched_set_timeout - set timeout for reset worker
>>> + *
>>> + * @sched: scheduler instance to set and (re)-start the worker for
>>> + * @timeout: timeout period
>>> + *
>>> + * Set and (re)-start the timeout for the given scheduler.
>>> + */
>>> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
>>> +{
>>> +	spin_lock(&sched->job_list_lock);
>>> +	sched->timeout = timeout;
>>> +	cancel_delayed_work(&sched->work_tdr);
>>
>> I see that the comment says "(re-)start"(sic). Is the rest of the logic
>> stable in that we don't need to use _sync() version, and/or at least
>> inspect the return value of the one currently used?
>>
> 
> Sorry for the delayed response, just reviving this series now and seeing
> this comment.
> 
> We don't care if the TDR is currently executing (at least in Xe which
> makes use of this function), that is totally fine we only care to change
> the future timeout values. I believe we actually call this from the TDR
> in Xe to set the timeout value to zero so using a sync version would
> deadlock. We do this as a mechanism to kill the drm_gpu_scheduler and
> immediately timeout all remaining jobs. We also call this in a few other
> places too with a value of zero for the same reason (kill the
> drm_gpu_scheduler).

(Catching up chronologically after vacation...)

Okay, that's fine, but this shows a need for an interface/logic to simply
kill the DRM gpu scheduler. So perhaps we need to provide that kind
of functionality, as opposed to gaming the scheduler--setting the timeout to 0
to kill the scheduler. Perhaps that would be simpler...?
-- 
Regards,
Luben

> 
> Matt
> 
>> Regards,
>> Luben
>>
>>> +	drm_sched_start_timeout(sched);
>>> +	spin_unlock(&sched->job_list_lock);
>>> +}
>>> +EXPORT_SYMBOL(drm_sched_set_timeout);
>>> +
>>>  /**
>>>   * drm_sched_fault - immediately start timeout handler
>>>   *
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 18172ae63ab7..6258e324bd7c 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>>  				    struct drm_gpu_scheduler **sched_list,
>>>                                     unsigned int num_sched_list);
>>>  
>>> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
>>>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>>>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>>  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>>