Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Philipp Stanner <pstanner@xxxxxxxxxx> · Fri, 22 Nov 2024 17:24:03 +0100

On Thu, 2024-11-21 at 10:05 -0800, Matthew Brost wrote:
> On Tue, Oct 29, 2024 at 08:22:22AM +0100, Philipp Stanner wrote:
> > Christian, Sima?
> > 
> > Matthew? (+CC)
> > 
> 
> Sorry catching up here. Internally in Xe we ref count the scheduler
> as
> the scheduler is embedded into our user queue structure
> (xe_exec_queue).
> Jobs ref the queue once they are armed. Upon queue killing we set TDR
> to
> zero which will flush out all jobs / signal all the job's fences.
> Once
> the ref count of queue is zero we do hardware / firmware teardown of
> the
> queue and then finally call drm_sched_fini before freeing the queues
> memory (and thus the scheduler).
> 
> This flow seems to work pretty well but largely bypasses the
> scheduler
> functions to implement this. Not sure if this flow could be
> normalized
> at all but I would expect usage models of a 1 to 1 relationship
> between
> queue and scheduler to something similar to Xe's flow.

One of the solution I proposed basically does the same: jobs added to
gpu_scheduler.pending_list refcount the scheduler.

If we'd go for that solution I assume the separate refcounting could be
removed from Xe. For exmaple.

>  Maybe we could
> write this done as design guideline so other drivers don't have to
> figure this out - this took me a bit to land on this.

I already improved the documentation a while ago [1], but didn't detail
how drivers are supposed to solve this. The reason is that I don't want
to encourage more separate solutions while we're working on solving it
centrally.

> With that, in general I agree with Christian's patch here complaining
> about pending jobs if drm_sched_fini is called.

Yes, firing a WARN_ON there is also fine from my POV.

P.

[1] https://lore.kernel.org/dri-devel/20241105143137.71893-2-pstanner@xxxxxxxxxx/

> 
> > Opinions on the below?
> > 
> > tl;dr:
> > I still think it's a good thing to detectably block in
> > drm_sched_fini(), or at the very least drm_sched_flush(), because
> > then
> 
> See above. I don't think drm_sched_fini should block rather just
> complain this patch does which says 'go fix your driver'.
> 
> Matt
> 
> > you'll find out that the driver is broken and can repair it.
> > 
> > P.
> > 
> > 
> > On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
> > > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> > > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König
> > > > wrote:
> > > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > > > > wrote:
> > > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König
> > > > > > > > > wrote:
> > > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian
> > > > > > > > > > > König
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Tearing down the scheduler with jobs still on
> > > > > > > > > > > > the
> > > > > > > > > > > > pending
> > > > > > > > > > > > list
> > > > > > > > > > > > can
> > > > > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > > > > drivers try
> > > > > > > > > > > > to
> > > > > > > > > > > > destroy a scheduler which still has work pushed
> > > > > > > > > > > > to
> > > > > > > > > > > > the HW.
> > > > > > > > > > > Did you have time yet to look into my proposed
> > > > > > > > > > > waitque-
> > > > > > > > > > > solution?
> > > > > > > > > > I don't remember seeing anything. What have I
> > > > > > > > > > missed?
> > > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@xxxxxxxxxx/
> > > > > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > > > > 
> > > > > > > > Interesting approach, I'm just not sure if we can or
> > > > > > > > should
> > > > > > > > wait in
> > > > > > > > drm_sched_fini().
> > > > > > We do agree that jobs still pending when drm_sched_fini()
> > > > > > starts
> > > > > > is
> > > > > > always a bug, right?
> > > > > 
> > > > > Correct, the question is how to avoid that.
> > > > > 
> > > > > > If so, what are the disadvantages of waiting in
> > > > > > drm_sched_fini()?
> > > > > > We
> > > > > > could block buggy drivers as I see it. Which wouldn't be
> > > > > > good,
> > > > > > but
> > > > > > could then be fixed on drivers' site.
> > > > > 
> > > > > Sima explained that pretty well: Don't block in fops->close,
> > > > > do
> > > > > that in
> > > > > fops->flush instead.
> > > > 
> > > > I agree that we shouldn't block in close(), but this
> > > > effectively
> > > > means that we
> > > > need to reference count the scheduler, right?
> > > > 
> > > > Otherwise, if we allow to just skip / interrupt the teardown,
> > > > we
> > > > can
> > > > still leak
> > > > memory.
> > > 
> > > Having thought about it, I agree with Danilo. Having something
> > > that
> > > shall wait on green light, but can be interrupted, is no
> > > guarantee
> > > and
> > > therefore not a feasible solution.
> > > 
> > > To break down the solution space, these seem to be our options:
> > >    1. We have something (either drm_sched_fini() or a helper,
> > > e.g.,
> > >       drm_sched_flush()) that definitely blocks until the pending
> > > list
> > >       has become empty.
> > >    2. We have jobs reference-count the scheduler, so the latter
> > > can
> > >       outlive the driver and will be freed some time later.
> > > 
> > > Can anyone think of a third solution?
> > > 
> > > 
> > > Solution #1 has the problem of obviously blocking unconditionally
> > > if
> > > the driver didn't make sure that all fences will be signaled
> > > within
> > > reasonable time. In my opinion, this would actually be an
> > > advantage,
> > > because it will be *very* noticable and force users to repair
> > > their
> > > driver. The driver *has* to guarantee that all fences will be
> > > signaled.
> > > If the driver has to do fishy things, having the blocking
> > > outsourced
> > > to
> > > the helper drm_sched_flush() would allow them to circumvent that.
> > > 
> > > Solution #2 has the problem of backend_ops.free_job() potentially
> > > using
> > > driver-data after the driver is gone, causing UAF. So with this
> > > solutions all drivers would have to be aware of the issue and
> > > handle
> > > it
> > > through one of DRMs primitives dedicated to such problems.
> > > 
> > > 
> > > Currently, all drivers either work around the problem internally
> > > or
> > > simply ignore it, it seems.
> > > 
> > > So I'd argue that both solutions are an improvement over the
> > > existing
> > > situation. My preference would be #1.
> > > 
> > > 
> > > Opinions?
> > > 
> > > P.
> > > 
> > > > 
> > > > > 
> > > > > One issue this solves is that when you send a SIGTERM the
> > > > > tear
> > > > > down
> > > > > handling
> > > > > first flushes all the FDs and then closes them.
> > > > > 
> > > > > So if flushing the FDs blocks because the process initiated
> > > > > sending
> > > > > a
> > > > > terabyte of data over a 300bps line (for example) you can
> > > > > still
> > > > > throw a
> > > > > SIGKILL and abort that as well.
> > > > > 
> > > > > If you would block in fops-close() that SIGKILL won't have
> > > > > any
> > > > > effect any
> > > > > more because by the time close() is called the process is
> > > > > gone
> > > > > and
> > > > > signals
> > > > > are already blocked.
> > > > > 
> > > > > And yes when I learned about that issue I was also buffed
> > > > > that
> > > > > handling like
> > > > > this in the UNIX design is nearly 50 years old and still
> > > > > applies
> > > > > to
> > > > > today.
> > > > > > > > Probably better to make that a separate function,
> > > > > > > > something
> > > > > > > > like
> > > > > > > > drm_sched_flush() or similar.
> > > > > > We could do that. Such a function could then be called by
> > > > > > drivers
> > > > > > which
> > > > > > are not sure whether all jobs are done before they start
> > > > > > tearing
> > > > > > down.
> > > > > 
> > > > > Yes exactly that's the idea. And give that flush function a
> > > > > return
> > > > > code so
> > > > > that it can return -EINTR.
> > > > > 
> > > > > > > Yeah I don't think we should smash this into
> > > > > > > drm_sched_fini
> > > > > > > unconditionally. I think conceptually there's about three
> > > > > > > cases:
> > > > > > > 
> > > > > > > - Ringbuffer schedules. Probably want everything as-is,
> > > > > > > because
> > > > > > >    drm_sched_fini is called long after all the entities
> > > > > > > are
> > > > > > > gone in
> > > > > > >    drm_device cleanup.
> > > > > > > 
> > > > > > > - fw scheduler hardware with preemption support. There we
> > > > > > > probably
> > > > > > > want to
> > > > > > >    nuke the context by setting the tdr timeout to zero
> > > > > > > (or
> > > > > > > maybe just
> > > > > > > as
> > > > > > >    long as context preemption takes to be efficient), and
> > > > > > > relying on
> > > > > > > the
> > > > > > >    normal gpu reset flow to handle things.
> > > > > > > drm_sched_entity_flush
> > > > > > > kinda
> > > > > > >    does this, except not really and it's a lot more
> > > > > > > focused
> > > > > > > on
> > > > > > > the
> > > > > > >    ringbuffer context. So maybe we want a new
> > > > > > > drm_sched_entity_kill.
> > > > > > > 
> > > > > > >    For this case calling drm_sched_fini() after the 1:1
> > > > > > > entity
> > > > > > > is gone
> > > > > > >    should not find any linger jobs, it would actually be
> > > > > > > a
> > > > > > > bug
> > > > > > > somewhere if
> > > > > > >    there's a job lingering. Maybe a sanity check that
> > > > > > > there's
> > > > > > > not just
> > > > > > > no
> > > > > > >    jobs lingering, but also no entity left would be good
> > > > > > > here?
> > > > > > The check for lingering ones is in Christian's patch here
> > > > > > IISC.
> > > > > > At which position would you imagine the check for the
> > > > > > entity
> > > > > > being
> > > > > > performed?
> > > > > > 
> > > > > > > - fw scheduler without preemption support. There we kinda
> > > > > > > need
> > > > > > > the
> > > > > > >    drm_sched_flush, except blocking in fops->close is not
> > > > > > > great. So
> > > > > > > instead
> > > > > > >    I think the following is better:
> > > > > > >    1. drm_sched_entity_stopped, which only stops new
> > > > > > > submissions (for
> > > > > > >    paranoia) but doesn't tear down the entity
> > > > > > Who would call that function?
> > > > > > Drivers using it voluntarily could just as well stop
> > > > > > accepting
> > > > > > new jobs
> > > > > > from userspace to their entities, couldn't they?
> > > > > > 
> > > > > > >    2. drm_dev_get
> > > > > > >    3. launch a worker which does a) drm_sched_flush (or
> > > > > > >    drm_sched_entity_flush or whatever we call it) b)
> > > > > > > drm_sched_entity_fini
> > > > > > >    + drm_sched_fini c) drm_dev_put
> > > > > > > 
> > > > > > >    Note that semantically this implements the refcount in
> > > > > > > the
> > > > > > > other
> > > > > > > path
> > > > > > >    from Phillip:
> > > > > > > 
> > > > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@xxxxxxxxxx/
> > > > > > >    Except it doesn't impose refcount on everyone else who
> > > > > > > doesn't need
> > > > > > > it,
> > > > > > >    and it doesn't even impose refcounting on drivers that
> > > > > > > do
> > > > > > > need it
> > > > > > >    because we use drm_sched_flush and a worker to achieve
> > > > > > > the
> > > > > > > same.
> > > > > > I indeed wasn't happy with the refcount approach for that
> > > > > > reason,
> > > > > > agreed.
> > > > > > 
> > > > > > > Essentially helper functions for the common use-cases
> > > > > > > instead
> > > > > > > of
> > > > > > > trying to
> > > > > > > solve them all by putting drm_sched_flush as a
> > > > > > > potentially
> > > > > > > very
> > > > > > > blocking
> > > > > > > function into drm_sched_fini.
> > > > > > I'm still not able to see why it blocking would be
> > > > > > undesired –
> > > > > > as
> > > > > > far
> > > > > > as I can see, it is only invoked on driver teardown, so not
> > > > > > during
> > > > > > active operation. Teardown doesn't happen that often, and
> > > > > > it
> > > > > > can
> > > > > > (if
> > > > > > implemented correctly) only block until the driver's code
> > > > > > has
> > > > > > signaled
> > > > > > the last fences. If that doesn't happen, the block would
> > > > > > reveal
> > > > > > a
> > > > > > bug.
> > > > > > 
> > > > > > But don't get me wrong: I don't want to *push* this
> > > > > > solution. I
> > > > > > just
> > > > > > want to understand when it could become a problem.
> > > > > > 
> > > > > > 
> > > > > > Wouldn't an explicitly blocking, separate function like
> > > > > > drm_sched_flush() or drm_sched_fini_flush() be a small,
> > > > > > doable
> > > > > > step
> > > > > > towards the right direction?
> > > > > 
> > > > > I think that this is the right thing to do, yes.
> > > > > 
> > > > > > > > > > > > When there are still entities with jobs the
> > > > > > > > > > > > situation
> > > > > > > > > > > > is
> > > > > > > > > > > > even
> > > > > > > > > > > > worse
> > > > > > > > > > > > since the dma_fences for those jobs can never
> > > > > > > > > > > > signal
> > > > > > > > > > > > we can
> > > > > > > > > > > > just
> > > > > > > > > > > > choose between potentially locking up core
> > > > > > > > > > > > memory
> > > > > > > > > > > > management and
> > > > > > > > > > > > random memory corruption. When drivers really
> > > > > > > > > > > > mess
> > > > > > > > > > > > it
> > > > > > > > > > > > up
> > > > > > > > > > > > that
> > > > > > > > > > > > well
> > > > > > > > > > > > let them run into a BUG_ON().
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Christian
> > > > > > > > > > > > König<christian.koenig@xxxxxxx>
> > > > > > > > > > > > ---
> > > > > > > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 19
> > > > > > > > > > > > ++++++++++++++++++-
> > > > > > > > > > > >     1 file changed, 18 insertions(+), 1
> > > > > > > > > > > > deletion(-)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git
> > > > > > > > > > > > a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > @@ -1333,17 +1333,34 @@ void
> > > > > > > > > > > > drm_sched_fini(struct
> > > > > > > > > > > > drm_gpu_scheduler
> > > > > > > > > > > > *sched)
> > > > > > > > > > > I agree with Sima that it should first be
> > > > > > > > > > > documented
> > > > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > function's
> > > > > > > > > > > docstring what the user is expected to have done
> > > > > > > > > > > before
> > > > > > > > > > > calling the
> > > > > > > > > > > function.
> > > > > > > > > > Good point, going to update the documentation as
> > > > > > > > > > well.
> > > > > > > > > Cool thing, thx.
> > > > > > > > > Would be great if everything (not totally trivial)
> > > > > > > > > necessary to
> > > > > > > > > be done
> > > > > > > > > before _fini() is mentioned.
> > > > > > > > > 
> > > > > > > > > One could also think about providing a hint at how
> > > > > > > > > the
> > > > > > > > > driver can
> > > > > > > > > do
> > > > > > > > > that. AFAICS the only way for the driver to ensure
> > > > > > > > > that
> > > > > > > > > is
> > > > > > > > > to
> > > > > > > > > maintain
> > > > > > > > > its own, separate list of submitted jobs.
> > > > > > > > Even with a duplicated pending list it's actually
> > > > > > > > currently
> > > > > > > > impossible to do
> > > > > > > > this fully cleanly.
> > > > > > > > 
> > > > > > > > The problem is that the dma_fence object gives no
> > > > > > > > guarantee
> > > > > > > > when
> > > > > > > > callbacks
> > > > > > > > are processed, e.g. they can be both processed from
> > > > > > > > interrupt
> > > > > > > > context as
> > > > > > > > well as from a CPU which called
> > > > > > > > dma_fence_is_signaled().
> > > > > > > > 
> > > > > > > > So when a driver (or drm_sched_fini) waits for the last
> > > > > > > > submitted
> > > > > > > > fence it
> > > > > > > > actually can be that the drm_sched object still needs
> > > > > > > > to do
> > > > > > > > some
> > > > > > > > processing.
> > > > > > > > See the hack in amdgpu_vm_tlb_seq() for more background
> > > > > > > > on
> > > > > > > > the
> > > > > > > > problem.
> > > > > > Oh dear ^^'
> > > > > > We better work towards fixing that centrally
> > > > > > 
> > > > > > Thanks,
> > > > > > P.
> > > > > > 
> > > > > > 
> > > > > > > So I thought this should be fairly easy because of the
> > > > > > > sched
> > > > > > > hw/public
> > > > > > > fence split: If we wait for both all jobs to finish and
> > > > > > > for
> > > > > > > all
> > > > > > > the
> > > > > > > sched
> > > > > > > work/tdr work to finish, and we make sure there's no
> > > > > > > entity
> > > > > > > existing
> > > > > > > that's not yet stopped we should catch them all?
> > > > > 
> > > > > Unfortunately not.
> > > > > 
> > > > > Even when you do a dma_fence_wait() on the last submission it
> > > > > can
> > > > > still be
> > > > > that another CPU is executing the callbacks to wake up the
> > > > > scheduler work
> > > > > item and cleanup the job.
> > > > > 
> > > > > That's one of the reasons why I think the design of keeping
> > > > > the
> > > > > job
> > > > > alive is
> > > > > so extremely awkward. The dma_fence as representation of the
> > > > > hw
> > > > > submission
> > > > > has a much better defined state machine and lifetime.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > >   Or at least I think
> > > > > > > it's
> > > > > > > a bug if any other code even tries to touch the hw fence.
> > > > > > > 
> > > > > > > If you have any other driver code which relies on the rcu
> > > > > > > freeing
> > > > > > > then I
> > > > > > > think that's just a separate concern and we can ignore
> > > > > > > that
> > > > > > > here
> > > > > > > since the
> > > > > > > fences themselves will till get rcu-delay freed even if
> > > > > > > drm_sched_fini has
> > > > > > > finished.
> > > > > > > -Sima
> > > > > > > 
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > > 
> > > > > > > > > P.
> > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Christian.
> > > > > > > > > > 
> > > > > > > > > > > P.
> > > > > > > > > > > 
> > > > > > > > > > > >      drm_sched_wqueue_stop(sched);
> > > > > > > > > > > > + /*
> > > > > > > > > > > > + * Tearing down the scheduler wile there are
> > > > > > > > > > > > still
> > > > > > > > > > > > unprocessed jobs can
> > > > > > > > > > > > + * lead to use after free issues in the
> > > > > > > > > > > > scheduler
> > > > > > > > > > > > fence.
> > > > > > > > > > > > + */
> > > > > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list));
> > > > > > > > > > > > +
> > > > > > > > > > > >      for (i = DRM_SCHED_PRIORITY_KERNEL; i <
> > > > > > > > > > > > sched-
> > > > > > > > > > > > > num_rqs;
> > > > > > > > > > > > i++)
> > > > > > > > > > > > {
> > > > > > > > > > > >      struct drm_sched_rq *rq = sched-
> > > > > > > > > > > > > sched_rq[i];
> > > > > > > > > > > >      spin_lock(&rq->lock);
> > > > > > > > > > > > - list_for_each_entry(s_entity, &rq-
> > > > > > > > > > > > > entities,
> > > > > > > > > > > > list)
> > > > > > > > > > > > + list_for_each_entry(s_entity, &rq-
> > > > > > > > > > > > > entities,
> > > > > > > > > > > > list) {
> > > > > > > > > > > > + /*
> > > > > > > > > > > > + * The justification for this
> > > > > > > > > > > > BUG_ON()
> > > > > > > > > > > > is
> > > > > > > > > > > > that tearing
> > > > > > > > > > > > + * down the scheduler while jobs
> > > > > > > > > > > > are
> > > > > > > > > > > > pending
> > > > > > > > > > > > leaves
> > > > > > > > > > > > + * dma_fences unsignaled. Since we
> > > > > > > > > > > > have
> > > > > > > > > > > > dependencies
> > > > > > > > > > > > + * from the core memory management
> > > > > > > > > > > > to
> > > > > > > > > > > > eventually signal
> > > > > > > > > > > > + * dma_fences this can trivially
> > > > > > > > > > > > lead to
> > > > > > > > > > > > a
> > > > > > > > > > > > system wide
> > > > > > > > > > > > + * stop because of a locked up
> > > > > > > > > > > > memory
> > > > > > > > > > > > management.
> > > > > > > > > > > > + */
> > > > > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity-
> > > > > > > > > > > > > job_queue));
> > > > > > > > > > > > +
> > > > > > > > > > > >      /*
> > > > > > > > > > > >      * Prevents reinsertion and marks
> > > > > > > > > > > > job_queue
> > > > > > > > > > > > as idle,
> > > > > > > > > > > >      * it will removed from rq in
> > > > > > > > > > > > drm_sched_entity_fini
> > > > > > > > > > > >      * eventually
> > > > > > > > > > > >      */
> > > > > > > > > > > >      s_entity->stopped = true;
> > > > > > > > > > > > + }
> > > > > > > > > > > >      spin_unlock(&rq->lock);
> > > > > > > > > > > >      kfree(sched->sched_rq[i]);
> > > > > > > > > > > >      }
> > > > 
> > > 
> > 
>