Re: [PATCH v3 10/13] drm/sched: Add helper to set TDR timeout

Luben Tuikov <luben.tuikov@xxxxxxx> · Wed, 13 Sep 2023 22:38:24 -0400

On 2023-09-11 22:16, Matthew Brost wrote:
> Add helper to set TDR timeout and restart the TDR with new timeout
> value. This will be used in XE, new Intel GPU driver, to trigger the TDR
> to cleanup drm_sched_entity that encounter errors.

Do you just want to trigger the cleanup or do you really want to
modify the timeout and requeue TDR delayed work (to be triggered
later at a different time)?

If the former, then might as well just add a function to queue that
right away. If the latter, then this would do, albeit with a few
notes as mentioned below.

> 
> Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
>  include/drm/gpu_scheduler.h            |  1 +
>  2 files changed, 19 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 9dbfab7be2c6..689fb6686e01 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -426,6 +426,24 @@ static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched)
>  	spin_unlock(&sched->job_list_lock);
>  }
>  
> +/**
> + * drm_sched_set_timeout - set timeout for reset worker
> + *
> + * @sched: scheduler instance to set and (re)-start the worker for
> + * @timeout: timeout period
> + *
> + * Set and (re)-start the timeout for the given scheduler.
> + */
> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
> +{

Well, I'd perhaps call this "drm_sched_set_tdr_timeout()", or something
to that effect, as "drm_sched_set_timeout()" isn't clear that it is indeed
a cleanup timeout. However, it's totally up to you. :-)

It appears that "long timeout" is the new job timeout, so it is possible
that a stuck job might be given old timeout + new timeout recovery time,
after this function is called.

> +	spin_lock(&sched->job_list_lock);
> +	sched->timeout = timeout;
> +	cancel_delayed_work(&sched->work_tdr);
> +	drm_sched_start_timeout(sched);
> +	spin_unlock(&sched->job_list_lock);
> +}
> +EXPORT_SYMBOL(drm_sched_set_timeout);

Well, drm_sched_start_timeout() (which also has a name lacking description, perhaps
it should be "drm_sched_start_tdr_timeout()" or "...start_cleanup_timeout()"), anyway,
so that function compares to MAX_SCHEDULE_TIMEOUT and pending list not being empty
before it requeues delayed TDR work item. So, while a remote possibility, this new
function may have the unintended consequence of canceling TDR, and never restarting it.
I see it grabs the lock, however. Maybe wrap it in "if (sched->timeout != MAX_SCHEDULE_TIMEOUT)"?
How about using mod_delayed_work()?
-- 
Regards,
Luben

> +
>  /**
>   * drm_sched_fault - immediately start timeout handler
>   *
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 5d753ecb5d71..b7b818cd81b6 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -596,6 +596,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>  				    struct drm_gpu_scheduler **sched_list,
>                                     unsigned int num_sched_list);
>  
> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
>  void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
>  void drm_sched_add_msg(struct drm_gpu_scheduler *sched,