On 1/29/24 08:44, Christian König wrote: > Am 26.01.24 um 17:29 schrieb Matthew Brost: >> On Fri, Jan 26, 2024 at 11:32:57AM +0100, Christian König wrote: >>> Am 25.01.24 um 18:30 schrieb Matthew Brost: >>>> On Thu, Jan 25, 2024 at 04:12:58PM +0100, Christian König wrote: >>>>> Am 24.01.24 um 22:08 schrieb Matthew Brost: >>>>>> All entities must be drained in the DRM scheduler run job worker to >>>>>> avoid the following case. An entity found that is ready, no job found >>>>>> ready on entity, and run job worker goes idle with other entities + jobs >>>>>> ready. Draining all ready entities (i.e. loop over all ready entities) >>>>>> in the run job worker ensures all job that are ready will be scheduled. >>>>> That doesn't make sense. drm_sched_select_entity() only returns entities >>>>> which are "ready", e.g. have a job to run. >>>>> >>>> That is what I thought too, hence my original design but it is not >>>> exactly true. Let me explain. >>>> >>>> drm_sched_select_entity() returns an entity with a non-empty spsc queue >>>> (job in queue) and no *current* waiting dependecies [1]. Dependecies for >>>> an entity can be added when drm_sched_entity_pop_job() is called [2][3] >>>> returning a NULL job. Thus we can get into a scenario where 2 entities >>>> A and B both have jobs and no current dependecies. A's job is waiting >>>> B's job, entity A gets selected first, a dependecy gets installed in >>>> drm_sched_entity_pop_job(), run work goes idle, and now we deadlock. >>> And here is the real problem. run work doesn't goes idle in that moment. >>> >>> drm_sched_run_job_work() should restarts itself until there is either no >>> more space in the ring buffer or it can't find a ready entity any more. >>> >>> At least that was the original design when that was all still driven by a >>> kthread. >>> >>> It can perfectly be that we messed this up when switching from kthread to a >>> work item. >>> >> Right, that what this patch does - the run worker does not go idle until >> no ready entities are found. That was incorrect in the original patch >> and fixed here. Do you have any issues with this fix? It has been tested >> 3x times and clearly fixes the issue. > > Ah! Yes in this case that patch here is a little bit ugly as well. > > The original idea was that run_job restarts so that we are able to pause > the submission thread without searching for an entity to submit more. > > I strongly suggest to replace the while loop with a call to > drm_sched_run_job_queue() so that when the entity can't provide a job we > just restart the queuing work. Note it's already included in rc2, so any changes need to be a followup fix. If these are important, then please make sure they get to rc3 :)