Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

Rob Clark <robdclark@xxxxxxxxx> · Wed, 10 Nov 2021 11:15:26 -0800

On Wed, Nov 10, 2021 at 1:50 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
>
> On Tue, Nov 09, 2021 at 08:17:01AM -0800, Rob Clark wrote:
> > On Tue, Nov 9, 2021 at 1:07 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
> > >
> > > On Mon, Nov 08, 2021 at 03:39:17PM -0800, Rob Clark wrote:
> > > > I stumbled across this thread when I ran into the same issue, while
> > > > working out how to move drm/msm to use scheduler's retire +
> > > > timeout/recovery (and get rid of our own mirror list of in-flight
> > > > jobs).  We already have hw error detection enabled, and it can signal
> > > > quite fast, so assuming the first job on the list is the guilty job
> > > > just won't work.
> > > >
> > > > But I was considering a slightly different approach to fixing this,
> > > > instead just handling it all in drm_sched_main() and getting rid of
> > > > the complicated kthread parking gymnastics.  Ie. something along the
> > > > lines of:
> > >
> > > So handling timeouts in the main sched thread wont work as soon as you
> > > have multiple engines and reset that impacts across engines:
> > >
> > > - Nothing is simplified since you still need to stop the other scheduler
> > >   threads.
> > >
> > > - You get deadlocks if 2 schedulers time out at the same time, and both
> > >   want to stop the other one.
> > >
> > > Hence workqueue. Now the rule for the wq is that you can only have one per
> > > reset domain, so
> > > - single engine you just take the one drm/sched provides
> > > - if reset affects all your engines in the chip, then you allocate on in
> > >   the drm_device and pass that to all
> > > - if you have a complex of gpus all interconnected (e.g. xgmi hive for
> > >   amd), then it's one wq for the entire hive
> > >
> > > _All_ reset related things must be run on that workqueue or things breaks,
> > > which means if you get hw fault that also needs to be run there. I guess
> > > we should either patch drm/sched to check you call that function from the
> > > right workqueue, or just handle it internally.
> >
> > Hmm, ok.. I guess it would be useful to better document the reasoning
> > for the current design, that would have steered me more towards the
> > approach taken in this patch.
>
> Maybe this was because you worked on an old kernel? Boris did update the
> kerneldoc as part of making gpu reset work for panfrost, which has this
> multi-engine reset problem. If that's not yet clear then we need to
> improve the docs further.

I saw that, and understood the ordered wq.. but missed the implication
regarding having to park other scheduler kthreads

BR,
-R

> AMD's problem is even worse, because their reset domain is the entire xgmi
> hive, so multiple pci devices.
>
> Also there might more issues in drm/sched ofc, e.g. I've looked a bit at
> ordering/barriers and I'm pretty sure a lot are still missing. Or at least
> we should have comments in the code explaining why it all works.
> -Daniel
>
> >
> > BR,
> > -R
> >
> > > -Daniel
> > >
> > > >
> > > > ---------------------
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index 67382621b429..4d6ce775c316 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -764,6 +764,45 @@ static bool drm_sched_blocked(struct
> > > > drm_gpu_scheduler *sched)
> > > >         return false;
> > > >  }
> > > >
> > > > +static bool handle_timeout(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +       struct drm_sched_job *bad;
> > > > +
> > > > +       if (!sched->has_timeout)
> > > > +               return false;
> > > > +
> > > > +       sched->has_timeout = false;
> > > > +
> > > > +       spin_lock(&sched->job_list_lock);
> > > > +       bad = list_first_entry_or_null(&sched->pending_list,
> > > > +                                      struct drm_sched_job, list);
> > > > +
> > > > +       if (!bad) {
> > > > +               spin_unlock(&sched->job_list_lock);
> > > > +               return false;
> > > > +       }
> > > > +
> > > > +       spin_unlock(&sched->job_list_lock);
> > > > +
> > > > +       if (sched->timeout_wq == system_wq) {
> > > > +               /*
> > > > +                * If driver has no specific requirements about serializing
> > > > +                * reset wrt. other engines, just call timedout_job() directly
> > > > +                */
> > > > +               sched->ops->timedout_job(job);
> > > > +       } else {
> > > > +               /*
> > > > +                * Otherwise queue it on timeout_wq and wait for it to complete
> > > > +                */
> > > > +               ... more typing needed here ...
> > > > +       }
> > > > +
> > > > +       if (sched->free_guilty) {
> > > > +               sched->ops->free_job(job);
> > > > +               sched->free_guilty = false;
> > > > +       }
> > > > +}
> > > > +
> > > >  /**
> > > >   * drm_sched_main - main scheduler thread
> > > >   *
> > > > @@ -787,6 +826,7 @@ static int drm_sched_main(void *param)
> > > >
> > > >                 wait_event_interruptible(sched->wake_up_worker,
> > > >                                          (cleanup_job =
> > > > drm_sched_get_cleanup_job(sched)) ||
> > > > +                                        handle_timeout(sched) ||
> > > >                                          (!drm_sched_blocked(sched) &&
> > > >                                           (entity =
> > > > drm_sched_select_entity(sched))) ||
> > > >                                          kthread_should_stop());
> > > > ---------------------
> > > >
> > > > drm_sched_fault() and the sw timeout handler would just set
> > > > sched->has_timeout and kick sched->wake_up_worker.
> > > >
> > > > And since we handle the timeout case after
> > > > drm_sched_get_cleanup_job(), we know that all of the successfully
> > > > completed jobs have already been popped off the list, and won't be
> > > > unfairly maligned.
> > > >
> > > > BR,
> > > > -R
> > > >
> > > > On Tue, Aug 31, 2021 at 6:29 PM Liu, Monk <Monk.Liu@xxxxxxx> wrote:
> > > > >
> > > > > [AMD Official Use Only]
> > > > >
> > > > > Okay, I will reprepare this patch
> > > > >
> > > > > Thanks
> > > > >
> > > > > ------------------------------------------
> > > > > Monk Liu | Cloud-GPU Core team
> > > > > ------------------------------------------
> > > > >
> > > > > -----Original Message-----
> > > > > From: Daniel Vetter <daniel@xxxxxxxx>
> > > > > Sent: Tuesday, August 31, 2021 9:02 PM
> > > > > To: Liu, Monk <Monk.Liu@xxxxxxx>
> > > > > Cc: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Chen, Jingwen <Jingwen.Chen@xxxxxxx>
> > > > > Subject: Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
> > > > >
> > > > > On Tue, Aug 31, 2021 at 02:59:02PM +0200, Daniel Vetter wrote:
> > > > > > Can we please have some actual commit message here, with detailed
> > > > > > explanation of the race/bug/whatever, how you fix it and why this is
> > > > > > the best option?
> > > > > >
> > > > > > On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
> > > > > > > tested-by: jingwen chen <jingwen.chen@xxxxxxx>
> > > > > > > Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx>
> > > > > > > Signed-off-by: jingwen chen <jingwen.chen@xxxxxxx>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/scheduler/sched_main.c | 24
> > > > > > > ++++--------------------
> > > > > > >  1 file changed, 4 insertions(+), 20 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > index ecf8140..894fdb24 100644
> > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct *work)
> > > > > > >     sched = container_of(work, struct drm_gpu_scheduler,
> > > > > > > work_tdr.work);
> > > > > > >
> > > > > > >     /* Protects against concurrent deletion in
> > > > > > > drm_sched_get_cleanup_job */
> > > > > > > +   if (!__kthread_should_park(sched->thread))
> > > > > >
> > > > > > This is a __ function, i.e. considered internal, and it's lockless
> > > > > > atomic, i.e. unordered. And you're not explaining why this works.
> > > > > >
> > > > > > Iow it's probably buggy, and an just unconditionally parking the
> > > > > > kthread is probably the right thing to do. If it's not the right thing
> > > > > > to do, there's a bug here for sure.
> > > > >
> > > > > Also why don't we reuse the function drivers already have to stop a scheduler thread? We seem to have two kthread_park now, that's probably one too much.
> > > > > -Daniel
> > > > >
> > > > > > > +           kthread_park(sched->thread);
> > > > > > > +
> > > > > > >     spin_lock(&sched->job_list_lock);
> > > > > > >     job = list_first_entry_or_null(&sched->pending_list,
> > > > > > >                                    struct drm_sched_job, list);
> > > > > > >
> > > > > > >     if (job) {
> > > > > > > -           /*
> > > > > > > -            * Remove the bad job so it cannot be freed by concurrent
> > > > > > > -            * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
> > > > > > > -            * is parked at which point it's safe.
> > > > > > > -            */
> > > > > > > -           list_del_init(&job->list);
> > > > > > >             spin_unlock(&sched->job_list_lock);
> > > > > > >
> > > > > > > +           /* vendor's timeout_job should call drm_sched_start() */
> > > > > > >             status = job->sched->ops->timedout_job(job);
> > > > > > >
> > > > > > >             /*
> > > > > > > @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
> > > > > > >     kthread_park(sched->thread);
> > > > > > >
> > > > > > >     /*
> > > > > > > -    * Reinsert back the bad job here - now it's safe as
> > > > > > > -    * drm_sched_get_cleanup_job cannot race against us and release the
> > > > > > > -    * bad job at this point - we parked (waited for) any in progress
> > > > > > > -    * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
> > > > > > > -    * now until the scheduler thread is unparked.
> > > > > > > -    */
> > > > > > > -   if (bad && bad->sched == sched)
> > > > > > > -           /*
> > > > > > > -            * Add at the head of the queue to reflect it was the earliest
> > > > > > > -            * job extracted.
> > > > > > > -            */
> > > > > > > -           list_add(&bad->list, &sched->pending_list);
> > > > > > > -
> > > > > > > -   /*
> > > > > > >      * Iterate the job list from later to  earlier one and either deactive
> > > > > > >      * their HW callbacks or remove them from pending list if they already
> > > > > > >      * signaled.
> > > > > > > --
> > > > > > > 2.7.4
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Daniel Vetter
> > > > > > Software Engineer, Intel Corporation
> > > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > > > > > ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76
> > > > > > b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376601170
> > > > > > 51194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > > > > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLg
> > > > > > KeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> > > > >
> > > > > --
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7CMonk.Liu%40amd.com%7C298815bea18f4fbf76b308d96c7f7a8b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660117051194614%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QzgCU7%2BPdA0aWL5%2BJLgKeKbGaMMGqeGI9KE0P0LXlN4%3D&amp;reserved=0
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch