On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: > > > Tearing down the scheduler with jobs still on the pending list > > > can > > > lead to use after free issues. Add a warning if drivers try to > > > destroy a scheduler which still has work pushed to the HW. > > Did you have time yet to look into my proposed waitque-solution? > > I don't remember seeing anything. What have I missed? https://lore.kernel.org/all/20240903094446.29797-2-pstanner@xxxxxxxxxx/ > > > > > > When there are still entities with jobs the situation is even > > > worse > > > since the dma_fences for those jobs can never signal we can just > > > choose between potentially locking up core memory management and > > > random memory corruption. When drivers really mess it up that > > > well > > > let them run into a BUG_ON(). > > > > > > Signed-off-by: Christian König <christian.koenig@xxxxxxx> > > > --- > > > drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > index f093616fe53c..8a46fab5cdc8 100644 > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct > > > drm_gpu_scheduler > > > *sched) > > I agree with Sima that it should first be documented in the > > function's > > docstring what the user is expected to have done before calling the > > function. > > Good point, going to update the documentation as well. Cool thing, thx. Would be great if everything (not totally trivial) necessary to be done before _fini() is mentioned. One could also think about providing a hint at how the driver can do that. AFAICS the only way for the driver to ensure that is to maintain its own, separate list of submitted jobs. P. > > Thanks, > Christian. > > > > > P. > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > > + /* > > > + * Tearing down the scheduler wile there are still > > > unprocessed jobs can > > > + * lead to use after free issues in the scheduler fence. > > > + */ > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > + > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; > > > i++) > > > { > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > > spin_lock(&rq->lock); > > > - list_for_each_entry(s_entity, &rq->entities, > > > list) > > > + list_for_each_entry(s_entity, &rq->entities, > > > list) { > > > + /* > > > + * The justification for this BUG_ON() > > > is > > > that tearing > > > + * down the scheduler while jobs are > > > pending > > > leaves > > > + * dma_fences unsignaled. Since we have > > > dependencies > > > + * from the core memory management to > > > eventually signal > > > + * dma_fences this can trivially lead to > > > a > > > system wide > > > + * stop because of a locked up memory > > > management. > > > + */ > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > job_queue)); > > > + > > > /* > > > * Prevents reinsertion and marks > > > job_queue > > > as idle, > > > * it will removed from rq in > > > drm_sched_entity_fini > > > * eventually > > > */ > > > s_entity->stopped = true; > > > + } > > > spin_unlock(&rq->lock); > > > kfree(sched->sched_rq[i]); > > > } >