Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

Rob Clark <robdclark@xxxxxxxxx> · Tue, 9 Nov 2021 08:17:01 -0800

On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
>
> On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
> > I stumbled across this thread when I ran into the same issue, while
> > working out how to move drm/msm to use scheduler's retire +
> > timeout/recovery (and get rid of our own mirror list of in-flight
> > jobs).  We already have hw error detection enabled, and it can signal
> > quite fast, so assuming the first job on the list is the guilty job
> > just won't work.
> >
> > But I was considering a slightly different approach to fixing this,
> > instead just handling it all in drm_sched_main() and getting rid of
> > the complicated kthread parking gymnastics.  Ie. something along the
> > lines of:
>
> So handling timeouts in the main sched thread wont work as soon as you
> have multiple engines and reset that impacts across engines:
>
> - Nothing is simplified since you still need to stop the other scheduler
>   threads.
>
> - You get deadlocks if 2 schedulers time out at the same time, and both
>   want to stop the other one.
>
> Hence workqueue. Now the rule for the wq is that you can only have one per
> reset domain, so
> - single engine you just take the one drm/sched provides
> - if reset affects all your engines in the chip, then you allocate on in
>   the drm_device and pass that to all
> - if you have a complex of gpus all interconnected (e.g. xgmi hive for
>   amd), then it's one wq for the entire hive
>
> _All_ reset related things must be run on that workqueue or things breaks,
> which means if you get hw fault that also needs to be run there. I guess
> we should either patch drm/sched to check you call that function from the
> right workqueue, or just handle it internally.

Hmm, ok.. I guess it would be useful to better document the reasoning
for the current design, that would have steered me more towards the
approach taken in this patch.

BR,
-R

> -Daniel
>
> >
> > ---------------------
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 67382621b429..4d6ce775c316 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
> > drm_gpu_scheduler *sched)
> >         return false;
> >  }
> >
> > +static bool handle_timeout(struct drm_gpu_scheduler *sched)
> > +{
> > +       struct drm_sched_job *bad;
> > +
> > +       if (!sched->has_timeout)
> > +               return false;
> > +
> > +       sched->has_timeout = false;
> > +
> > +       spin_lock(&sched->job_list_lock);
> > +       bad = list_first_entry_or_null(&sched->pending_list,
> > +                                      struct drm_sched_job, list);
> > +
> > +       if (!bad) {
> > +               spin_unlock(&sched->job_list_lock);
> > +               return false;
> > +       }
> > +
> > +       spin_unlock(&sched->job_list_lock);
> > +
> > +       if (sched->timeout_wq == system_wq) {
> > +               /*
> > +                * If driver has no specific requirements about serializing
> > +                * reset wrt. other engines, just call timedout_job() directly
> > +                */
> > +               sched->ops->timedout_job(job);
> > +       } else {
> > +               /*
> > +                * Otherwise queue it on timeout_wq and wait for it to complete
> > +                */
> > +               ... more typing needed here ...
> > +       }
> > +
> > +       if (sched->free_guilty) {
> > +               sched->ops->free_job(job);
> > +               sched->free_guilty = false;
> > +       }
> > +}
> > +
> >  /**
> >   * drm_sched_main - main scheduler thread
> >   *
> > @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
> >
> >                 wait_event_interruptible(sched->wake_up_worker,
> >                                          (cleanup_job =
> > drm_sched_get_cleanup_job(sched)) ||
> > +                                        handle_timeout(sched) ||
> >                                          (!drm_sched_blocked(sched) &&
> >                                           (entity =
> > drm_sched_select_entity(sched))) ||
> >                                          kthread_should_stop());
> > ---------------------
> >
> > drm_sched_fault() and the sw timeout handler would just set
> > sched->has_timeout and kick sched->wake_up_worker.
> >
> > And since we handle the timeout case after
> > drm_sched_get_cleanup_job(), we know that all of the successfully
> > completed jobs have already been popped off the list, and won't be
> > unfairly maligned.
> >
> > BR,
> > -R
> >
> > On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@xxxxxxx> wrote:
> > >
> > > [AMD Official Use Only]
> > >
> > > Okay, I will reprepare this patch
> > >
> > > Thanks
> > >
> > > ------------------------------------------
> > > Monk Liu | Cloud-GPU Core team
> > > ------------------------------------------
> > >
> > > -----Original Message-----
> > > From: Daniel Vetter <daniel@xxxxxxxx>
> > > Sent: Tuesday, August 31, 2021 9:02 PM
> > > To: Liu, Monk <Monk.Liu@xxxxxxx>
> > > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Chen, Jingwen <Jingwen.Chen@xxxxxxx>
> > > Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> > >
> > > On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> > > > Can we please have some actual commit message here, with detailed
> > > > explanation of the race/bug/whatever, how you fix it and why this is
> > > > the best option?
> > > >
> > > > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > > > > tested-by: jingwen chen <jingwen.chen@xxxxxxx>
> > > > > Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx>
> > > > > Signed-off-by: jingwen chen <jingwen.chen@xxxxxxx>
> > > > > ---
> > > > >  drivers/gpu/drm/scheduler/sched_main.c | 24
> > > > > ++++--------------------
> > > > >  1 file changed, 4 insertions(+), 20 deletions(-)
> > > > >
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index ecf8140..894fdb24 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > >     sched = container_of(work, struct drm_gpu_scheduler,
> > > > > work_tdr.work);
> > > > >
> > > > >     /* Protects against concurrent deletion in
> > > > > drm_sched_get_cleanup_job */
> > > > > +   if (!__kthread_should_park(sched->thread))
> > > >
> > > > This is a __ function, i.e. considered internal, and it's lockless
> > > > atomic, i.e. unordered. And you're not explaining why this works.
> > > >
> > > > Iow it's probably buggy, and an just unconditionally parking the
> > > > kthread is probably the right thing to do. If it's not the right thing
> > > > to do, there's a bug here for sure.
> > >
> > > Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
> > > -Daniel
> > >
> > > > > +           kthread_park(sched->thread);
> > > > > +
> > > > >     spin_lock(&sched->job_list_lock);
> > > > >     job = list_first_entry_or_null(&sched->pending_list,
> > > > >                                    struct drm_sched_job, list);
> > > > >
> > > > >     if (job) {
> > > > > -           /*
> > > > > -            * Remove the bad job so it cannot be freed by concurrent
> > > > > -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > -            * is parked at which point it's safe.
> > > > > -            */
> > > > > -           list_del_init(&job->list);
> > > > >             spin_unlock(&sched->job_list_lock);
> > > > >
> > > > > +           /* vendor's timeout_job should call drm_sched_start() */
> > > > >             status = job->sched->ops->timedout_job(job);
> > > > >
> > > > >             /*
> > > > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > >     kthread_park(sched->thread);
> > > > >
> > > > >     /*
> > > > > -    * Reinsert back the bad job here - now it's safe as
> > > > > -    * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > -    * bad job at this point - we parked (waited for) any in progress
> > > > > -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > -    * now until the scheduler thread is unparked.
> > > > > -    */
> > > > > -   if (bad && bad->sched == sched)
> > > > > -           /*
> > > > > -            * Add at the head of the queue to reflect it was the earliest
> > > > > -            * job extracted.
> > > > > -            */
> > > > > -           list_add(&bad->list, &sched->pending_list);
> > > > > -
> > > > > -   /*
> > > > >      * Iterate the job list from later to  earlier one and either deactive
> > > > >      * their HW callbacks or remove them from pending list if they already
> > > > >      * signaled.
> > > > > --
> > > > > 2.7.4
> > > > >
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > > > ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
> > > > b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
> > > > 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
> > > > KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch