On 2023-11-21 04:00, Bert Karwatzki wrote: > Since linux-next-20231115 my linux system (debian sid on msi alpha 15 laptop) > suffers from random deadlocks which can occur after 30 - 180min of usage. These > deadlocks can be actively provoked by creating high system load (usually by > compiling a kernel with make -j NRCPUS) and the opening instances of libreoffice > --writer until the system GUI locks (the mouse cursor can still be moved but the > screen is frozen). In this state ssh'ing into the machine is still possible and > at least sometimes log messages about hung tasks appear in /var/log/kern.log. > > More info can be found here: > https://gitlab.freedesktop.org/drm/amd/-/issues/2994 > > Using the method described to trigger the bug I bisected the problem in the > linux-next and drm-misc trees to give commit f3123c2590005 as the problem. > As this simple patch fixes the problem > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index 044a8c4875ba..25b97db1b623 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > struct drm_sched_entity *entity) > { > - if (drm_sched_entity_is_ready(entity)) > - if (drm_sched_can_queue(sched, entity)) > - drm_sched_run_job_queue(sched); > + if (drm_sched_can_queue(sched, entity)) > + drm_sched_run_job_queue(sched); > } > > /** > > there might be in the entity->dependency branch of drm_sched_entity_is_ready > (some kind of circular dependencies ...). > > To see if the change to drm_sched_wakeup is the actual cause of the problem or > if this problem has been cause by the redesign of the drm scheduler in linux > next-20231115+, I created the following patch for linux-6.6.0: > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > b/drivers/gpu/drm/scheduler/sched_entity.c > index a42763e1429d..dc2abd299aeb 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, > container_of(cb, struct drm_sched_entity, cb); > > drm_sched_entity_clear_dep(f, cb); > - drm_sched_wakeup_if_can_queue(entity->rq->sched); > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity); > } > > /** > @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job > *sched_job) > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > drm_sched_rq_update_fifo(entity, submit_ts); > > - drm_sched_wakeup_if_can_queue(entity->rq->sched); > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity); > } > } > EXPORT_SYMBOL(drm_sched_entity_push_job); > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index 5a3a622fc672..bbe06403b33d 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler > *sched) > * > * Wake up the scheduler if we can queue jobs. > */ > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched) > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct > drm_sched_entity *entity) > { > - if (drm_sched_can_queue(sched)) > - wake_up_interruptible(&sched->wake_up_worker); > + if(drm_sched_entity_is_ready(entity)) > + if (drm_sched_can_queue(sched)) > + wake_up_interruptible(&sched->wake_up_worker); > } > > /** > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > index ac65f0626cfc..6cfe3d193e69 100644 > --- a/include/drm/gpu_scheduler.h > +++ b/include/drm/gpu_scheduler.h > @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity > *entity, > unsigned int num_sched_list); > > void drm_sched_job_cleanup(struct drm_sched_job *job); > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched); > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct > drm_sched_entity *entity); > void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job > *bad); > void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery); > void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched); > > This brings the extra check to the old scheduler and has so far not caused any > trouble (using the same stress test described above), so chances are that the > error is somewhere else in redesigned scheduler. > > > Bert Karwatzki Hi Bert, Thanks for looking into this. As an afterthought, removing the "entity_is_ready()" qualifier in wake-up, makes the scheduling more opportunistic, and I agree that that is the more correct approach. Commit f3123c2590005, basically made the code as close to the way it functioned before the move to run-queues. Would you like to create a patch for this? -- Regards, Luben
Attachment:
OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature