Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Danilo Krummrich <dakr@xxxxxxxxxx> · Mon, 14 Oct 2024 16:49:50 +0200

On Tue, Sep 24, 2024 at 01:18:25PM +0200, Simona Vetter wrote:
> On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:
> > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > > > > Tearing down the scheduler with jobs still on the pending list
> > > > > > can
> > > > > > lead to use after free issues. Add a warning if drivers try to
> > > > > > destroy a scheduler which still has work pushed to the HW.
> > > > > Did you have time yet to look into my proposed waitque-solution?
> > > > I don't remember seeing anything. What have I missed?
> > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@xxxxxxxxxx/
> > 
> > Mhm, I didn't got that in my inbox for some reason.
> > 
> > Interesting approach, I'm just not sure if we can or should wait in
> > drm_sched_fini().
> > 
> > Probably better to make that a separate function, something like
> > drm_sched_flush() or similar.
> 
> Yeah I don't think we should smash this into drm_sched_fini
> unconditionally. I think conceptually there's about three cases:
> 
> - Ringbuffer schedules. Probably want everything as-is, because
>   drm_sched_fini is called long after all the entities are gone in
>   drm_device cleanup.

Even if entities are long gone we can still have jobs on the pending list of the
scheduler. Calling drm_sched_fini() would still leak those jobs.

It's just that for this case it's unlikely, but I don't think we want to rely on
"unlikely".

Do I miss anything?

> 
> - fw scheduler hardware with preemption support. There we probably want to
>   nuke the context by setting the tdr timeout to zero (or maybe just as
>   long as context preemption takes to be efficient), and relying on the
>   normal gpu reset flow to handle things. drm_sched_entity_flush kinda
>   does this, except not really and it's a lot more focused on the
>   ringbuffer context. So maybe we want a new drm_sched_entity_kill.
> 
>   For this case calling drm_sched_fini() after the 1:1 entity is gone
>   should not find any linger jobs, it would actually be a bug somewhere if
>   there's a job lingering. Maybe a sanity check that there's not just no
>   jobs lingering, but also no entity left would be good here?

This is a timing assumption too, isn't it? What if we reach drm_sched_fini()
quicker than all the free() callbacks from the jobs on the pending list are
processed?

I know Xe uses the "tdr to zero trick", so I'm not aware of all implications --
in Nouveau I had to do it differently.

> 
> - fw scheduler without preemption support. There we kinda need the
>   drm_sched_flush, except blocking in fops->close is not great. So instead
>   I think the following is better:
>   1. drm_sched_entity_stopped, which only stops new submissions (for
>   paranoia) but doesn't tear down the entity
>   2. drm_dev_get
>   3. launch a worker which does a) drm_sched_flush (or
>   drm_sched_entity_flush or whatever we call it) b) drm_sched_entity_fini

I think it would be drm_sched_flush(); the jobs on the pending are not
associated with any entity any more. See also 96c7c2f4d5bd ("drm/scheduler: set
entity to NULL in drm_sched_entity_pop_job()").

>   + drm_sched_fini c) drm_dev_put
> 
>   Note that semantically this implements the refcount in the other path
>   from Phillip:
> 
>   https://lore.kernel.org/all/20240903094531.29893-2-pstanner@xxxxxxxxxx/
>   
>   Except it doesn't impose refcount on everyone else who doesn't need it,
>   and it doesn't even impose refcounting on drivers that do need it
>   because we use drm_sched_flush and a worker to achieve the same.
> 
> Essentially helper functions for the common use-cases instead of trying to
> solve them all by putting drm_sched_flush as a potentially very blocking
> function into drm_sched_fini.
> 
> 
> > > > > > When there are still entities with jobs the situation is even
> > > > > > worse
> > > > > > since the dma_fences for those jobs can never signal we can just
> > > > > > choose between potentially locking up core memory management and
> > > > > > random memory corruption. When drivers really mess it up that
> > > > > > well
> > > > > > let them run into a BUG_ON().
> > > > > > 
> > > > > > Signed-off-by: Christian König<christian.koenig@xxxxxxx>
> > > > > > ---
> > > > > >    drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
> > > > > >    1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > > > > drm_gpu_scheduler
> > > > > > *sched)
> > > > > I agree with Sima that it should first be documented in the
> > > > > function's
> > > > > docstring what the user is expected to have done before calling the
> > > > > function.
> > > > Good point, going to update the documentation as well.
> > > Cool thing, thx.
> > > Would be great if everything (not totally trivial) necessary to be done
> > > before _fini() is mentioned.
> > > 
> > > One could also think about providing a hint at how the driver can do
> > > that. AFAICS the only way for the driver to ensure that is to maintain
> > > its own, separate list of submitted jobs.
> > 
> > Even with a duplicated pending list it's actually currently impossible to do
> > this fully cleanly.
> > 
> > The problem is that the dma_fence object gives no guarantee when callbacks
> > are processed, e.g. they can be both processed from interrupt context as
> > well as from a CPU which called dma_fence_is_signaled().
> > 
> > So when a driver (or drm_sched_fini) waits for the last submitted fence it
> > actually can be that the drm_sched object still needs to do some processing.
> > See the hack in amdgpu_vm_tlb_seq() for more background on the problem.
> 
> So I thought this should be fairly easy because of the sched hw/public
> fence split: If we wait for both all jobs to finish and for all the sched
> work/tdr work to finish, and we make sure there's no entity existing
> that's not yet stopped we should catch them all? Or at least I think it's
> a bug if any other code even tries to touch the hw fence.
> 
> If you have any other driver code which relies on the rcu freeing then I
> think that's just a separate concern and we can ignore that here since the
> fences themselves will till get rcu-delay freed even if drm_sched_fini has
> finished.
> -Sima
> 
> > 
> > Regards,
> > Christian.
> > 
> > > 
> > > P.
> > > 
> > > > Thanks,
> > > > Christian.
> > > > 
> > > > > P.
> > > > > 
> > > > > >    	drm_sched_wqueue_stop(sched);
> > > > > > +	/*
> > > > > > +	 * Tearing down the scheduler wile there are still
> > > > > > unprocessed jobs can
> > > > > > +	 * lead to use after free issues in the scheduler fence.
> > > > > > +	 */
> > > > > > +	WARN_ON(!list_empty(&sched->pending_list));
> > > > > > +
> > > > > >    	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > > > > > i++)
> > > > > > {
> > > > > >    		struct drm_sched_rq *rq = sched->sched_rq[i];
> > > > > >    		spin_lock(&rq->lock);
> > > > > > -		list_for_each_entry(s_entity, &rq->entities,
> > > > > > list)
> > > > > > +		list_for_each_entry(s_entity, &rq->entities,
> > > > > > list) {
> > > > > > +			/*
> > > > > > +			 * The justification for this BUG_ON()
> > > > > > is
> > > > > > that tearing
> > > > > > +			 * down the scheduler while jobs are
> > > > > > pending
> > > > > > leaves
> > > > > > +			 * dma_fences unsignaled. Since we have
> > > > > > dependencies
> > > > > > +			 * from the core memory management to
> > > > > > eventually signal
> > > > > > +			 * dma_fences this can trivially lead to
> > > > > > a
> > > > > > system wide
> > > > > > +			 * stop because of a locked up memory
> > > > > > management.
> > > > > > +			 */
> > > > > > +			BUG_ON(spsc_queue_count(&s_entity-
> > > > > > > job_queue));
> > > > > > +
> > > > > >    			/*
> > > > > >    			 * Prevents reinsertion and marks
> > > > > > job_queue
> > > > > > as idle,
> > > > > >    			 * it will removed from rq in
> > > > > > drm_sched_entity_fini
> > > > > >    			 * eventually
> > > > > >    			 */
> > > > > >    			s_entity->stopped = true;
> > > > > > +		}
> > > > > >    		spin_unlock(&rq->lock);
> > > > > >    		kfree(sched->sched_rq[i]);
> > > > > >    	}
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>