Am Mittwoch, dem 22.11.2023 um 18:02 -0500 schrieb Luben Tuikov: > On 2023-11-21 04:00, Bert Karwatzki wrote: > > Since linux-next-20231115 my linux system (debian sid on msi alpha 15 > > laptop) > > suffers from random deadlocks which can occur after 30 - 180min of usage. > > These > > deadlocks can be actively provoked by creating high system load (usually by > > compiling a kernel with make -j NRCPUS) and the opening instances of > > libreoffice > > --writer until the system GUI locks (the mouse cursor can still be moved but > > the > > screen is frozen). In this state ssh'ing into the machine is still possible > > and > > at least sometimes log messages about hung tasks appear in > > /var/log/kern.log. > > > > More info can be found here: > > https://gitlab.freedesktop.org/drm/amd/-/issues/2994 > > > > Using the method described to trigger the bug I bisected the problem in the > > linux-next and drm-misc trees to give commit f3123c2590005 as the problem. > > As this simple patch fixes the problem > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > b/drivers/gpu/drm/scheduler/sched_main.c > > index 044a8c4875ba..25b97db1b623 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -1029,9 +1029,8 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > > struct drm_sched_entity *entity) > > { > > - if (drm_sched_entity_is_ready(entity)) > > - if (drm_sched_can_queue(sched, entity)) > > - drm_sched_run_job_queue(sched); > > + if (drm_sched_can_queue(sched, entity)) > > + drm_sched_run_job_queue(sched); > > } > > > > /** > > > > there might be in the entity->dependency branch of drm_sched_entity_is_ready > > (some kind of circular dependencies ...). > > > > To see if the change to drm_sched_wakeup is the actual cause of the problem > > or > > if this problem has been cause by the redesign of the drm scheduler in linux > > next-20231115+, I created the following patch for linux-6.6.0: > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > > b/drivers/gpu/drm/scheduler/sched_entity.c > > index a42763e1429d..dc2abd299aeb 100644 > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > @@ -358,7 +358,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, > > container_of(cb, struct drm_sched_entity, cb); > > > > drm_sched_entity_clear_dep(f, cb); > > - drm_sched_wakeup_if_can_queue(entity->rq->sched); > > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity); > > } > > > > /** > > @@ -590,7 +590,7 @@ void drm_sched_entity_push_job(struct drm_sched_job > > *sched_job) > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > drm_sched_rq_update_fifo(entity, submit_ts); > > > > - drm_sched_wakeup_if_can_queue(entity->rq->sched); > > + drm_sched_wakeup_if_can_queue(entity->rq->sched, entity); > > } > > } > > EXPORT_SYMBOL(drm_sched_entity_push_job); > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > b/drivers/gpu/drm/scheduler/sched_main.c > > index 5a3a622fc672..bbe06403b33d 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -865,10 +865,11 @@ static bool drm_sched_can_queue(struct > > drm_gpu_scheduler > > *sched) > > * > > * Wake up the scheduler if we can queue jobs. > > */ > > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched) > > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct > > drm_sched_entity *entity) > > { > > - if (drm_sched_can_queue(sched)) > > - wake_up_interruptible(&sched->wake_up_worker); > > + if(drm_sched_entity_is_ready(entity)) > > + if (drm_sched_can_queue(sched)) > > + wake_up_interruptible(&sched->wake_up_worker); > > } > > > > /** > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > > index ac65f0626cfc..6cfe3d193e69 100644 > > --- a/include/drm/gpu_scheduler.h > > +++ b/include/drm/gpu_scheduler.h > > @@ -548,7 +548,7 @@ void drm_sched_entity_modify_sched(struct > > drm_sched_entity > > *entity, > > unsigned int num_sched_list); > > > > void drm_sched_job_cleanup(struct drm_sched_job *job); > > -void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched); > > +void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched, struct > > drm_sched_entity *entity); > > void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job > > *bad); > > void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery); > > void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched); > > > > This brings the extra check to the old scheduler and has so far not caused > > any > > trouble (using the same stress test described above), so chances are that > > the > > error is somewhere else in redesigned scheduler. > > > > > > Bert Karwatzki > > Hi Bert, > > Thanks for looking into this. > > As an afterthought, removing the "entity_is_ready()" qualifier in wake-up, > makes > the scheduling more opportunistic, and I agree that that is the more correct > approach. > Commit f3123c2590005, basically made the code as close to the way it > functioned before > the move to run-queues. > > Would you like to create a patch for this? Should I create the patch (is a simple revert of f3123c.. + explaining commit message enough?) for the drm-misc tree? Also, I did a little more experimentation: I applied f3123c25900 on top of commit a6149f0393699 and commit 35963cf2cd25e to see where the problems start. The result is that f3.. + a6.. leads to lock ups while f3.. + 35.. seems to be fine so the problems starts with the move from kthread to work queue. Bert Karwatzki