Re: drm scheduler redesign causes deadlocks [extended repost]

Luben Tuikov <ltuikov89@xxxxxxxxx> · Wed, 22 Nov 2023 18:02:23 -0500

On 2023-11-21 04:00, Bert Karwatzki wrote:
> Since linux-next-20231115 my linux system (debian sid on msi alpha 15 laptop)
> suffers from random deadlocks which can occur after  30 - 180min of usage. These
> deadlocks can be actively provoked by creating high system load (usually by
> compiling a kernel with make -j NRCPUS) and the opening instances of libreoffice
> --writer until the system GUI locks (the mouse cursor can still be moved but the
> screen is frozen). In this state ssh'ing into the machine is still possible and
> at least sometimes log messages about hung tasks appear in /var/log/kern.log.
> 
> More info can be found here:
> https://gitlab.freedesktop.org/drm/amd/-/issues/2994
> 
> Using the method described to trigger the bug I bisected the problem in the
> linux-next and drm-misc trees to give commit f3123c2590005 as the problem.
> As this simple patch fixes the problem
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 044a8c4875ba..25b97db1b623 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup);
>  void drm_sched_wakeup(struct drm_gpu_scheduler *sched,
>                       struct drm_sched_entity *entity)
>  {
> -       if (drm_sched_entity_is_ready(entity))
> -               if (drm_sched_can_queue(sched, entity))
> -                       drm_sched_run_job_queue(sched);
> +       if (drm_sched_can_queue(sched, entity))
> +               drm_sched_run_job_queue(sched);
>  }
>  
>  /**
> 
> there might be in the entity->dependency branch of drm_sched_entity_is_ready
> (some kind of circular dependencies ...).
> 
> To see if the change to drm_sched_wakeup is the actual cause of the problem or
> if this problem has been cause by the redesign of the drm scheduler in linux
> next-20231115+, I created the following patch for linux-6.6.0:
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> b/drivers/gpu/drm/scheduler/sched_entity.c
> index a42763e1429d..dc2abd299aeb 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>  container_of(cb, struct drm_sched_entity, cb);
> 
>  drm_sched_entity_clear_dep(f, cb);
> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  }
> 
>  /**
> @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job
> *sched_job)
>  if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
>  drm_sched_rq_update_fifo(entity, submit_ts);
> 
> - drm_sched_wakeup_if_can_queue(entity->rq->sched);
> + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity);
>  }
>  }
>  EXPORT_SYMBOL(drm_sched_entity_push_job);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 5a3a622fc672..bbe06403b33d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler
> *sched)
>   *
>   * Wake up the scheduler if we can queue jobs.
>   */
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> drm_sched_entity *entity)
>  {
> - if (drm_sched_can_queue(sched))
> - wake_up_interruptible(&sched->wake_up_worker);
> + if(drm_sched_entity_is_ready(entity))
> + if (drm_sched_can_queue(sched))
> + wake_up_interruptible(&sched->wake_up_worker);
>  }
> 
>  /**
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index ac65f0626cfc..6cfe3d193e69 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity
> *entity,
>                                     unsigned int num_sched_list);
> 
>  void drm_sched_job_cleanup(struct drm_sched_job *job);
> -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
> +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct
> drm_sched_entity *entity);
>  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job
> *bad);
>  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
>  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
> 
> This brings the extra check to the old scheduler and has so far not caused any
> trouble (using the same stress test described above), so chances are that the
> error is somewhere else in redesigned scheduler.
> 
> 
> Bert Karwatzki

Hi Bert,

Thanks for looking into this.

As an afterthought, removing the "entity_is_ready()" qualifier in wake-up, makes
the scheduling more opportunistic, and I agree that that is the more correct approach.
Commit f3123c2590005, basically made the code as close to the way it functioned before
the move to run-queues.

Would you like to create a patch for this?
-- 
Regards,
Luben
Attachment:
OpenPGP_0x4C15479431A334AF.asc

Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc

Description: OpenPGP digital signature