Re: drm scheduler redesign causes deadlocks [extended repost]

Bert Karwatzki <spasswolf@xxxxxx> · Fri, 24 Nov 2023 10:38:38 +0100

Am Mittwoch, dem 22.11.2023 um 18:02 -0500 schrieb Luben Tuikov:
> On 2023-11-21 04:00, Bert Karwatzki wrote:
> > Since linux-next-20231115 my linux system (debian sid on msi alpha 15
> > laptop)
> > suffers from random deadlocks which can occur after  30 - 180min of usage.
> > These
> > deadlocks can be actively provoked by creating high system load (usually by
> > compiling a kernel with make -j NRCPUS) and the opening instances of
> > libreoffice
> > --writer until the system GUI locks (the mouse cursor can still be moved but
> > the
> > screen is frozen). In this state ssh'ing into the machine is still possible
> > and
> > at least sometimes log messages about hung tasks appear in
> > /var/log/kern.log.
> >
> > More info can be found here:
> > https://gitlab.freedesktop.org/drm/amd/-/issues/2994
> >
> > Using the method described to trigger the bug I bisected the problem in the
> > linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
> > As this simple patch fixes the problem
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 044a8c4875ba..25b97db1b623 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
> >  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
> >                       struct drm_sched_entity *entity)
> >  {
> > -       if (drm_sched_entity_is_ready(entity))
> > -               if (drm_sched_can_queue(sched, entity))
> > -                       drm_sched_run_job_queue(sched);
> > +       if (drm_sched_can_queue(sched, entity))
> > +               drm_sched_run_job_queue(sched);
> >  }
> >  
> >  /**
> >
> > there might be in the entity->dependency branch of drm_sched_entity_is_ready
> > (some kind of circular dependencies ...).
> >
> > To see if the change to drm_sched_wakeup is the actual cause of the problem
> > or
> > if this problem has been cause by the redesign of the drm scheduler in linux
> > next-20231115+, I created the following patch for linux-6.6.0:
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index a42763e1429d..dc2abd299aeb 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
> >  container_of(cb, struct drm_sched_entity, cb);
> >
> >  drm_sched_entity_clear_dep(f, cb);
> > - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
> >  }
> >
> >  /**
> > @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
> > *sched_job)
> >  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> >  drm_sched_rq_update_fifo(entity, submit_ts);
> >
> > - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
> >  }
> >  }
> >  EXPORT_SYMBOL(drm_sched_entity_push_job);
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 5a3a622fc672..bbe06403b33d 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct
> > drm_gpu_scheduler
> > *sched)
> >   *
> >   * Wake up the scheduler if we can queue jobs.
> >   */
> > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> > drm_sched_entity *entity)
> >  {
> > - if (drm_sched_can_queue(sched))
> > - wake_up_interruptible(&sched->wake_up_worker);
> > + if(drm_sched_entity_is_ready(entity))
> > + if (drm_sched_can_queue(sched))
> > + wake_up_interruptible(&sched->wake_up_worker);
> >  }
> >
> >  /**
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index ac65f0626cfc..6cfe3d193e69 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct
> > drm_sched_entity
> > *entity,
> >                                     unsigned int num_sched_list);
> >
> >  void drm_sched_job_cleanup(struct drm_sched_job *job);
> > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> > drm_sched_entity *entity);
> >  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
> > *bad);
> >  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
> >  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> >
> > This brings the extra check to the old scheduler and has so far not caused
> > any
> > trouble (using the same stress test described above), so chances are that
> > the
> > error is somewhere else in redesigned scheduler.
> >
> >
> > Bert Karwatzki
>
> Hi Bert,
>
> Thanks for looking into this.
>
> As an afterthought, removing the "entity_is_ready()" qualifier in wake-up,
> makes
> the scheduling more opportunistic, and I agree that that is the more correct
> approach.
> Commit f3123c2590005, basically made the code as close to the way it
> functioned before
> the move to run-queues.
>
> Would you like to create a patch for this?

Should I create the patch (is a simple revert of f3123c.. + explaining commit
message enough?) for the drm-misc tree?

Also, I did a little more experimentation: I applied f3123c25900 on top
of commit a6149f0393699 and commit 35963cf2cd25e to see where the problems
start. The result is that f3.. + a6.. leads to lock ups while f3.. + 35.. seems
to be fine so the problems starts with the move from kthread to work queue.

Bert Karwatzki