On Wed, 19 Mar 2025 19:51:47 +0000 Adrian Larumbe <adrian.larumbe@xxxxxxxxxxxxx> wrote: > On 10.03.2025 13:30, Ashley Smith wrote: > > The timeout logic provided by drm_sched leads to races when we try > > to suspend it while the drm_sched workqueue queues more jobs. Let's > > overhaul the timeout handling in panthor to have our own delayed work > > that's resumed/suspended when a group is resumed/suspended. When an > > actual timeout occurs, we call drm_sched_fault() to report it > > through drm_sched, still. But otherwise, the drm_sched timeout is > > disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of > > how we protect modifications on the timer. > > > > One issue seems to be when we call drm_sched_suspend_timeout() from > > both queue_run_job() and tick_work() which could lead to races due to > > drm_sched_suspend_timeout() not having a lock. Another issue seems to > > be in queue_run_job() if the group is not scheduled, we suspend the > > timeout again which undoes what drm_sched_job_begin() did when calling > > drm_sched_start_timeout(). So the timeout does not reset when a job > > is finished. > > > > Co-developed-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> > > Signed-off-by: Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> > > Tested-by: Daniel Stone <daniels@xxxxxxxxxxxxx> > > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") > > Signed-off-by: Ashley Smith <ashley.smith@xxxxxxxxxxxxx> > > Reviewed-by: Adrián Larumbe <adrian.larumbe@xxxxxxxxxxxxx> > > > --- > > drivers/gpu/drm/panthor/panthor_sched.c | 233 +++++++++++++++++------- > > 1 file changed, 167 insertions(+), 66 deletions(-) > > > > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c > > index 4d31d1967716..5f02d2ec28f9 100644 > > --- a/drivers/gpu/drm/panthor/panthor_sched.c > > +++ b/drivers/gpu/drm/panthor/panthor_sched.c > > @@ -360,17 +360,20 @@ struct panthor_queue { > > /** @entity: DRM scheduling entity used for this queue. */ > > struct drm_sched_entity entity; > > > > - /** > > - * @remaining_time: Time remaining before the job timeout expires. > > - * > > - * The job timeout is suspended when the queue is not scheduled by the > > - * FW. Every time we suspend the timer, we need to save the remaining > > - * time so we can restore it later on. > > - */ > > - unsigned long remaining_time; > > + /** @timeout: Queue timeout related fields. */ > > + struct { > > + /** @timeout.work: Work executed when a queue timeout occurs. */ > > + struct delayed_work work; > > Nit: Maybe for the sake of sticking to the convention of naming already > existing delayed_work structs in a way that reflects their goal, call > this one 'timeout_work'. It's already under the timeout struct, and naming it timeout_work would be redundant IMHO (timeout.timeout_work vs timeout.work).