On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote: > > > Am 24.01.24 um 22:08 schrieb Matthew Brost: > > All entities must be drained in the DRM scheduler run job worker to > > avoid the following case. An entity found that is ready, no job found > > ready on entity, and run job worker goes idle with other entities + jobs > > ready. Draining all ready entities (i.e. loop over all ready entities) > > in the run job worker ensures all job that are ready will be scheduled. > > That doesn't make sense. drm_sched_select_entity() only returns entities > which are "ready", e.g. have a job to run. > That is what I thought too, hence my original design but it is not exactly true. Let me explain. drm_sched_select_entity() returns an entity with a non-empty spsc queue (job in queue) and no *current* waiting dependecies [1]. Dependecies for an entity can be added when drm_sched_entity_pop_job() is called [2][3] returning a NULL job. Thus we can get into a scenario where 2 entities A and B both have jobs and no current dependecies. A's job is waiting B's job, entity A gets selected first, a dependecy gets installed in drm_sched_entity_pop_job(), run work goes idle, and now we deadlock. The proper solution is to loop over all ready entities until one with a job is found via drm_sched_entity_pop_job() and then requeue the run job worker. Or loop over all entities until drm_sched_select_entity() returns NULL and then let the run job worker go idle. This is what the old threaded design did too [4]. Hope this clears everything up. Matt [1] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L144 [2] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L464 [3] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_entity.c#L397 [4] https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L1011 > If that's not the case any more then you have broken something else. > > Regards, > Christian. > > > > > Cc: Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> > > Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@xxxxxxxxx> > > Closes: https://lore.kernel.org/all/CABXGCsM2VLs489CH-vF-1539-s3in37=bwuOWtoeeE+q26zE+Q@xxxxxxxxxxxxxx/ > > Reported-and-tested-by: Mario Limonciello <mario.limonciello@xxxxxxx> > > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3124 > > Link: https://lore.kernel.org/all/20240123021155.2775-1-mario.limonciello@xxxxxxx/ > > Reported-by: Vlastimil Babka <vbabka@xxxxxxx> > > Closes: https://lore.kernel.org/dri-devel/05ddb2da-b182-4791-8ef7-82179fd159a8@xxxxxxx/T/#m0c31d4d1b9ae9995bb880974c4f1dbaddc33a48a > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 15 +++++++-------- > > 1 file changed, 7 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index 550492a7a031..85f082396d42 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -1178,21 +1178,20 @@ static void drm_sched_run_job_work(struct work_struct *w) > > struct drm_sched_entity *entity; > > struct dma_fence *fence; > > struct drm_sched_fence *s_fence; > > - struct drm_sched_job *sched_job; > > + struct drm_sched_job *sched_job = NULL; > > int r; > > if (READ_ONCE(sched->pause_submit)) > > return; > > - entity = drm_sched_select_entity(sched); > > + /* Find entity with a ready job */ > > + while (!sched_job && (entity = drm_sched_select_entity(sched))) { > > + sched_job = drm_sched_entity_pop_job(entity); > > + if (!sched_job) > > + complete_all(&entity->entity_idle); > > + } > > if (!entity) > > - return; > > - > > - sched_job = drm_sched_entity_pop_job(entity); > > - if (!sched_job) { > > - complete_all(&entity->entity_idle); > > return; /* No more work */ > > - } > > s_fence = sched_job->s_fence; >