Re: [PATCH v2] drm/panthor: Make the timeout per-queue instead of per-job

Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> · Thu, 20 Mar 2025 08:53:03 +0100

On Wed, 19 Mar 2025 19:51:47 +0000
Adrian Larumbe <adrian.larumbe@xxxxxxxxxxxxx> wrote:

> On 10.03.2025 13:30, Ashley Smith wrote:
> > The timeout logic provided by drm_sched leads to races when we try
> > to suspend it while the drm_sched workqueue queues more jobs. Let's
> > overhaul the timeout handling in panthor to have our own delayed work
> > that's resumed/suspended when a group is resumed/suspended. When an
> > actual timeout occurs, we call drm_sched_fault() to report it
> > through drm_sched, still. But otherwise, the drm_sched timeout is
> > disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of
> > how we protect modifications on the timer.
> >
> > One issue seems to be when we call drm_sched_suspend_timeout() from
> > both queue_run_job() and tick_work() which could lead to races due to
> > drm_sched_suspend_timeout() not having a lock. Another issue seems to
> > be in queue_run_job() if the group is not scheduled, we suspend the
> > timeout again which undoes what drm_sched_job_begin() did when calling
> > drm_sched_start_timeout(). So the timeout does not reset when a job
> > is finished.
> >
> > Co-developed-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx>
> > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx>
> > Tested-by: Daniel Stone <daniels@xxxxxxxxxxxxx>
> > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
> > Signed-off-by: Ashley Smith <ashley.smith@xxxxxxxxxxxxx>  
> 
> Reviewed-by: Adrián Larumbe <adrian.larumbe@xxxxxxxxxxxxx>
> 
> > ---
> >  drivers/gpu/drm/panthor/panthor_sched.c | 233 +++++++++++++++++-------
> >  1 file changed, 167 insertions(+), 66 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
> > index 4d31d1967716..5f02d2ec28f9 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > @@ -360,17 +360,20 @@ struct panthor_queue {
> >  	/** @entity: DRM scheduling entity used for this queue. */
> >  	struct drm_sched_entity entity;
> >
> > -	/**
> > -	 * @remaining_time: Time remaining before the job timeout expires.
> > -	 *
> > -	 * The job timeout is suspended when the queue is not scheduled by the
> > -	 * FW. Every time we suspend the timer, we need to save the remaining
> > -	 * time so we can restore it later on.
> > -	 */
> > -	unsigned long remaining_time;
> > +	/** @timeout: Queue timeout related fields. */
> > +	struct {
> > +		/** @timeout.work: Work executed when a queue timeout occurs. */
> > +		struct delayed_work work;  
> 
> Nit: Maybe for the sake of sticking to the convention of naming already
> existing delayed_work structs in a way that reflects their goal, call
> this one 'timeout_work'.

It's already under the timeout struct, and naming it timeout_work would
be redundant IMHO (timeout.timeout_work vs timeout.work).