Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Philipp Stanner <pstanner@xxxxxxxxxx> · Fri, 15 Nov 2024 14:55:33 +0100

On Fri, 2024-11-15 at 12:26 +0100, Christian König wrote:
> Interesting, those mails just showed up in my inbox as unread. More
> than 
> 14 days after you send it.
> 
> So sorry for the late reply.

I smell google mail :p

> 
> Am 29.10.24 um 08:22 schrieb Philipp Stanner:
> > Christian, Sima?
> > 
> > Matthew? (+CC)
> > 
> > Opinions on the below?
> > 
> > tl;dr:
> > I still think it's a good thing to detectably block in
> > drm_sched_fini(), or at the very least drm_sched_flush(), because
> > then
> > you'll find out that the driver is broken and can repair it.
> 
> As discussed in private mail as well, the third option not mentioned
> so 
> far is to completely drop the free_job callback.

If it's really possible to do that I think this sounds like a great
solution at first glance because we could decrease complexity quite
substantially.

> 
> A good bunch of the problems we have here are caused by abusing the
> job 
> as state for executing the submission on the hardware.
> 
> The original design idea of the scheduler instead was to only have
> the 
> job as intermediate state between submission and picking what to
> execute 
> next. E.g. the scheduler in that view is just a pipeline were you
> push 
> jobs in on one end and jobs come out on the other end.

So let's get a bit more precise about this:
   1. Driver enqueues with drm_sched_job_arm()
   2. job ends up in pending_list
   3. Sooner or later scheduler calls run_job()
   4. Job is pushed to hardware
   5. Fence gets signaled
   6. ???

What would the "come out on the other end" part you describe look like?

How would jobs get removed from pending_list and, accordingly, how
would we avoid leaks?

> 
> In that approach the object which represents the hardware execution
> is 
> the dma_fence instead of the job. And the dma_fence has a well
> defined 
> lifetime, error handling, etc...
> 
> So when we go back to this original approach it would mean that we
> don't 
> need to wait for any cleanup because the scheduler wouldn't be
> involved 
> in the execution of the jobs on the hardware any more.

It would be involved (to speak precisely) in the sense of the scheduler
still being the one who pushes the job onto the hardware, agreed?

> 
> The worst thing that could happen is that the driver messes things up
> and still has not executed job in an entity,

I can't fully follow.

So in your mind, the driver would personally set the dma_fence callback
and hereby be informed about the job being completed, right?

But you wouldn't want to aim for getting rid of struct drm_sched_job
completely, or would you?

Grüße,
P.

> but in that case we could
> just warn, block for the hardware fence and then kill all pending 
> submissions with an error.- And this blocking on the hardware fence 
> can't deadlock because we know that it is a hardware fence with
> certain 
> properties and not some random free_job callback where we don't have
> any 
> idea what locks the driver need.
> 
> Regards,
> Christian.
> 
> > 
> > P.
> > 
> > 
> > On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
> > > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> > > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König
> > > > wrote:
> > > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > > > > wrote:
> > > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König
> > > > > > > > > wrote:
> > > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian
> > > > > > > > > > > König
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Tearing down the scheduler with jobs still on
> > > > > > > > > > > > the
> > > > > > > > > > > > pending
> > > > > > > > > > > > list
> > > > > > > > > > > > can
> > > > > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > > > > drivers try
> > > > > > > > > > > > to
> > > > > > > > > > > > destroy a scheduler which still has work pushed
> > > > > > > > > > > > to
> > > > > > > > > > > > the HW.
> > > > > > > > > > > Did you have time yet to look into my proposed
> > > > > > > > > > > waitque-
> > > > > > > > > > > solution?
> > > > > > > > > > I don't remember seeing anything. What have I
> > > > > > > > > > missed?
> > > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@xxxxxxxxxx/
> > > > > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > > > > 
> > > > > > > > Interesting approach, I'm just not sure if we can or
> > > > > > > > should
> > > > > > > > wait in
> > > > > > > > drm_sched_fini().
> > > > > > We do agree that jobs still pending when drm_sched_fini()
> > > > > > starts
> > > > > > is
> > > > > > always a bug, right?
> > > > > Correct, the question is how to avoid that.
> > > > > 
> > > > > > If so, what are the disadvantages of waiting in
> > > > > > drm_sched_fini()?
> > > > > > We
> > > > > > could block buggy drivers as I see it. Which wouldn't be
> > > > > > good,
> > > > > > but
> > > > > > could then be fixed on drivers' site.
> > > > > Sima explained that pretty well: Don't block in fops->close,
> > > > > do
> > > > > that in
> > > > > fops->flush instead.
> > > > I agree that we shouldn't block in close(), but this
> > > > effectively
> > > > means that we
> > > > need to reference count the scheduler, right?
> > > > 
> > > > Otherwise, if we allow to just skip / interrupt the teardown,
> > > > we
> > > > can
> > > > still leak
> > > > memory.
> > > Having thought about it, I agree with Danilo. Having something
> > > that
> > > shall wait on green light, but can be interrupted, is no
> > > guarantee
> > > and
> > > therefore not a feasible solution.
> > > 
> > > To break down the solution space, these seem to be our options:
> > >     1. We have something (either drm_sched_fini() or a helper,
> > > e.g.,
> > >        drm_sched_flush()) that definitely blocks until the
> > > pending
> > > list
> > >        has become empty.
> > >     2. We have jobs reference-count the scheduler, so the latter
> > > can
> > >        outlive the driver and will be freed some time later.
> > > 
> > > Can anyone think of a third solution?
> > > 
> > > 
> > > Solution #1 has the problem of obviously blocking unconditionally
> > > if
> > > the driver didn't make sure that all fences will be signaled
> > > within
> > > reasonable time. In my opinion, this would actually be an
> > > advantage,
> > > because it will be *very* noticable and force users to repair
> > > their
> > > driver. The driver *has* to guarantee that all fences will be
> > > signaled.
> > > If the driver has to do fishy things, having the blocking
> > > outsourced
> > > to
> > > the helper drm_sched_flush() would allow them to circumvent that.
> > > 
> > > Solution #2 has the problem of backend_ops.free_job() potentially
> > > using
> > > driver-data after the driver is gone, causing UAF. So with this
> > > solutions all drivers would have to be aware of the issue and
> > > handle
> > > it
> > > through one of DRMs primitives dedicated to such problems.
> > > 
> > > 
> > > Currently, all drivers either work around the problem internally
> > > or
> > > simply ignore it, it seems.
> > > 
> > > So I'd argue that both solutions are an improvement over the
> > > existing
> > > situation. My preference would be #1.
> > > 
> > > 
> > > Opinions?
> > > 
> > > P.
> > > 
> > > > > One issue this solves is that when you send a SIGTERM the
> > > > > tear
> > > > > down
> > > > > handling
> > > > > first flushes all the FDs and then closes them.
> > > > > 
> > > > > So if flushing the FDs blocks because the process initiated
> > > > > sending
> > > > > a
> > > > > terabyte of data over a 300bps line (for example) you can
> > > > > still
> > > > > throw a
> > > > > SIGKILL and abort that as well.
> > > > > 
> > > > > If you would block in fops-close() that SIGKILL won't have
> > > > > any
> > > > > effect any
> > > > > more because by the time close() is called the process is
> > > > > gone
> > > > > and
> > > > > signals
> > > > > are already blocked.
> > > > > 
> > > > > And yes when I learned about that issue I was also buffed
> > > > > that
> > > > > handling like
> > > > > this in the UNIX design is nearly 50 years old and still
> > > > > applies
> > > > > to
> > > > > today.
> > > > > > > > Probably better to make that a separate function,
> > > > > > > > something
> > > > > > > > like
> > > > > > > > drm_sched_flush() or similar.
> > > > > > We could do that. Such a function could then be called by
> > > > > > drivers
> > > > > > which
> > > > > > are not sure whether all jobs are done before they start
> > > > > > tearing
> > > > > > down.
> > > > > Yes exactly that's the idea. And give that flush function a
> > > > > return
> > > > > code so
> > > > > that it can return -EINTR.
> > > > > 
> > > > > > > Yeah I don't think we should smash this into
> > > > > > > drm_sched_fini
> > > > > > > unconditionally. I think conceptually there's about three
> > > > > > > cases:
> > > > > > > 
> > > > > > > - Ringbuffer schedules. Probably want everything as-is,
> > > > > > > because
> > > > > > >     drm_sched_fini is called long after all the entities
> > > > > > > are
> > > > > > > gone in
> > > > > > >     drm_device cleanup.
> > > > > > > 
> > > > > > > - fw scheduler hardware with preemption support. There we
> > > > > > > probably
> > > > > > > want to
> > > > > > >     nuke the context by setting the tdr timeout to zero
> > > > > > > (or
> > > > > > > maybe just
> > > > > > > as
> > > > > > >     long as context preemption takes to be efficient),
> > > > > > > and
> > > > > > > relying on
> > > > > > > the
> > > > > > >     normal gpu reset flow to handle things.
> > > > > > > drm_sched_entity_flush
> > > > > > > kinda
> > > > > > >     does this, except not really and it's a lot more
> > > > > > > focused
> > > > > > > on
> > > > > > > the
> > > > > > >     ringbuffer context. So maybe we want a new
> > > > > > > drm_sched_entity_kill.
> > > > > > > 
> > > > > > >     For this case calling drm_sched_fini() after the 1:1
> > > > > > > entity
> > > > > > > is gone
> > > > > > >     should not find any linger jobs, it would actually be
> > > > > > > a
> > > > > > > bug
> > > > > > > somewhere if
> > > > > > >     there's a job lingering. Maybe a sanity check that
> > > > > > > there's
> > > > > > > not just
> > > > > > > no
> > > > > > >     jobs lingering, but also no entity left would be good
> > > > > > > here?
> > > > > > The check for lingering ones is in Christian's patch here
> > > > > > IISC.
> > > > > > At which position would you imagine the check for the
> > > > > > entity
> > > > > > being
> > > > > > performed?
> > > > > > 
> > > > > > > - fw scheduler without preemption support. There we kinda
> > > > > > > need
> > > > > > > the
> > > > > > >     drm_sched_flush, except blocking in fops->close is
> > > > > > > not
> > > > > > > great. So
> > > > > > > instead
> > > > > > >     I think the following is better:
> > > > > > >     1. drm_sched_entity_stopped, which only stops new
> > > > > > > submissions (for
> > > > > > >     paranoia) but doesn't tear down the entity
> > > > > > Who would call that function?
> > > > > > Drivers using it voluntarily could just as well stop
> > > > > > accepting
> > > > > > new jobs
> > > > > > from userspace to their entities, couldn't they?
> > > > > > 
> > > > > > >     2. drm_dev_get
> > > > > > >     3. launch a worker which does a) drm_sched_flush (or
> > > > > > >     drm_sched_entity_flush or whatever we call it) b)
> > > > > > > drm_sched_entity_fini
> > > > > > >     + drm_sched_fini c) drm_dev_put
> > > > > > > 
> > > > > > >     Note that semantically this implements the refcount
> > > > > > > in the
> > > > > > > other
> > > > > > > path
> > > > > > >     from Phillip:
> > > > > > > 
> > > > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@xxxxxxxxxx/
> > > > > > >     Except it doesn't impose refcount on everyone else
> > > > > > > who
> > > > > > > doesn't need
> > > > > > > it,
> > > > > > >     and it doesn't even impose refcounting on drivers
> > > > > > > that do
> > > > > > > need it
> > > > > > >     because we use drm_sched_flush and a worker to
> > > > > > > achieve the
> > > > > > > same.
> > > > > > I indeed wasn't happy with the refcount approach for that
> > > > > > reason,
> > > > > > agreed.
> > > > > > 
> > > > > > > Essentially helper functions for the common use-cases
> > > > > > > instead
> > > > > > > of
> > > > > > > trying to
> > > > > > > solve them all by putting drm_sched_flush as a
> > > > > > > potentially
> > > > > > > very
> > > > > > > blocking
> > > > > > > function into drm_sched_fini.
> > > > > > I'm still not able to see why it blocking would be
> > > > > > undesired –
> > > > > > as
> > > > > > far
> > > > > > as I can see, it is only invoked on driver teardown, so not
> > > > > > during
> > > > > > active operation. Teardown doesn't happen that often, and
> > > > > > it
> > > > > > can
> > > > > > (if
> > > > > > implemented correctly) only block until the driver's code
> > > > > > has
> > > > > > signaled
> > > > > > the last fences. If that doesn't happen, the block would
> > > > > > reveal
> > > > > > a
> > > > > > bug.
> > > > > > 
> > > > > > But don't get me wrong: I don't want to *push* this
> > > > > > solution. I
> > > > > > just
> > > > > > want to understand when it could become a problem.
> > > > > > 
> > > > > > 
> > > > > > Wouldn't an explicitly blocking, separate function like
> > > > > > drm_sched_flush() or drm_sched_fini_flush() be a small,
> > > > > > doable
> > > > > > step
> > > > > > towards the right direction?
> > > > > I think that this is the right thing to do, yes.
> > > > > 
> > > > > > > > > > > > When there are still entities with jobs the
> > > > > > > > > > > > situation
> > > > > > > > > > > > is
> > > > > > > > > > > > even
> > > > > > > > > > > > worse
> > > > > > > > > > > > since the dma_fences for those jobs can never
> > > > > > > > > > > > signal
> > > > > > > > > > > > we can
> > > > > > > > > > > > just
> > > > > > > > > > > > choose between potentially locking up core
> > > > > > > > > > > > memory
> > > > > > > > > > > > management and
> > > > > > > > > > > > random memory corruption. When drivers really
> > > > > > > > > > > > mess
> > > > > > > > > > > > it
> > > > > > > > > > > > up
> > > > > > > > > > > > that
> > > > > > > > > > > > well
> > > > > > > > > > > > let them run into a BUG_ON().
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Christian
> > > > > > > > > > > > König<christian.koenig@xxxxxxx>
> > > > > > > > > > > > ---
> > > > > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c |
> > > > > > > > > > > > 19
> > > > > > > > > > > > ++++++++++++++++++-
> > > > > > > > > > > >      1 file changed, 18 insertions(+), 1
> > > > > > > > > > > > deletion(-)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git
> > > > > > > > > > > > a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > @@ -1333,17 +1333,34 @@ void
> > > > > > > > > > > > drm_sched_fini(struct
> > > > > > > > > > > > drm_gpu_scheduler
> > > > > > > > > > > > *sched)
> > > > > > > > > > > I agree with Sima that it should first be
> > > > > > > > > > > documented
> > > > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > function's
> > > > > > > > > > > docstring what the user is expected to have done
> > > > > > > > > > > before
> > > > > > > > > > > calling the
> > > > > > > > > > > function.
> > > > > > > > > > Good point, going to update the documentation as
> > > > > > > > > > well.
> > > > > > > > > Cool thing, thx.
> > > > > > > > > Would be great if everything (not totally trivial)
> > > > > > > > > necessary to
> > > > > > > > > be done
> > > > > > > > > before _fini() is mentioned.
> > > > > > > > > 
> > > > > > > > > One could also think about providing a hint at how
> > > > > > > > > the
> > > > > > > > > driver can
> > > > > > > > > do
> > > > > > > > > that. AFAICS the only way for the driver to ensure
> > > > > > > > > that
> > > > > > > > > is
> > > > > > > > > to
> > > > > > > > > maintain
> > > > > > > > > its own, separate list of submitted jobs.
> > > > > > > > Even with a duplicated pending list it's actually
> > > > > > > > currently
> > > > > > > > impossible to do
> > > > > > > > this fully cleanly.
> > > > > > > > 
> > > > > > > > The problem is that the dma_fence object gives no
> > > > > > > > guarantee
> > > > > > > > when
> > > > > > > > callbacks
> > > > > > > > are processed, e.g. they can be both processed from
> > > > > > > > interrupt
> > > > > > > > context as
> > > > > > > > well as from a CPU which called
> > > > > > > > dma_fence_is_signaled().
> > > > > > > > 
> > > > > > > > So when a driver (or drm_sched_fini) waits for the last
> > > > > > > > submitted
> > > > > > > > fence it
> > > > > > > > actually can be that the drm_sched object still needs
> > > > > > > > to do
> > > > > > > > some
> > > > > > > > processing.
> > > > > > > > See the hack in amdgpu_vm_tlb_seq() for more background
> > > > > > > > on
> > > > > > > > the
> > > > > > > > problem.
> > > > > > Oh dear ^^'
> > > > > > We better work towards fixing that centrally
> > > > > > 
> > > > > > Thanks,
> > > > > > P.
> > > > > > 
> > > > > > 
> > > > > > > So I thought this should be fairly easy because of the
> > > > > > > sched
> > > > > > > hw/public
> > > > > > > fence split: If we wait for both all jobs to finish and
> > > > > > > for
> > > > > > > all
> > > > > > > the
> > > > > > > sched
> > > > > > > work/tdr work to finish, and we make sure there's no
> > > > > > > entity
> > > > > > > existing
> > > > > > > that's not yet stopped we should catch them all?
> > > > > Unfortunately not.
> > > > > 
> > > > > Even when you do a dma_fence_wait() on the last submission it
> > > > > can
> > > > > still be
> > > > > that another CPU is executing the callbacks to wake up the
> > > > > scheduler work
> > > > > item and cleanup the job.
> > > > > 
> > > > > That's one of the reasons why I think the design of keeping
> > > > > the
> > > > > job
> > > > > alive is
> > > > > so extremely awkward. The dma_fence as representation of the
> > > > > hw
> > > > > submission
> > > > > has a much better defined state machine and lifetime.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > >    Or at least I think
> > > > > > > it's
> > > > > > > a bug if any other code even tries to touch the hw fence.
> > > > > > > 
> > > > > > > If you have any other driver code which relies on the rcu
> > > > > > > freeing
> > > > > > > then I
> > > > > > > think that's just a separate concern and we can ignore
> > > > > > > that
> > > > > > > here
> > > > > > > since the
> > > > > > > fences themselves will till get rcu-delay freed even if
> > > > > > > drm_sched_fini has
> > > > > > > finished.
> > > > > > > -Sima
> > > > > > > 
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > > 
> > > > > > > > > P.
> > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Christian.
> > > > > > > > > > 
> > > > > > > > > > > P.
> > > > > > > > > > > 
> > > > > > > > > > > >       drm_sched_wqueue_stop(sched);
> > > > > > > > > > > > + /*
> > > > > > > > > > > > + * Tearing down the scheduler wile there are
> > > > > > > > > > > > still
> > > > > > > > > > > > unprocessed jobs can
> > > > > > > > > > > > + * lead to use after free issues in the
> > > > > > > > > > > > scheduler
> > > > > > > > > > > > fence.
> > > > > > > > > > > > + */
> > > > > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list));
> > > > > > > > > > > > +
> > > > > > > > > > > >       for (i = DRM_SCHED_PRIORITY_KERNEL; i <
> > > > > > > > > > > > sched-
> > > > > > > > > > > > > num_rqs;
> > > > > > > > > > > > i++)
> > > > > > > > > > > > {
> > > > > > > > > > > >       struct drm_sched_rq *rq = sched-
> > > > > > > > > > > > > sched_rq[i];
> > > > > > > > > > > >       spin_lock(&rq->lock);
> > > > > > > > > > > > - list_for_each_entry(s_entity, &rq-
> > > > > > > > > > > > > entities,
> > > > > > > > > > > > list)
> > > > > > > > > > > > + list_for_each_entry(s_entity, &rq-
> > > > > > > > > > > > > entities,
> > > > > > > > > > > > list) {
> > > > > > > > > > > > + /*
> > > > > > > > > > > > + * The justification for this
> > > > > > > > > > > > BUG_ON()
> > > > > > > > > > > > is
> > > > > > > > > > > > that tearing
> > > > > > > > > > > > + * down the scheduler while jobs
> > > > > > > > > > > > are
> > > > > > > > > > > > pending
> > > > > > > > > > > > leaves
> > > > > > > > > > > > + * dma_fences unsignaled. Since we
> > > > > > > > > > > > have
> > > > > > > > > > > > dependencies
> > > > > > > > > > > > + * from the core memory management
> > > > > > > > > > > > to
> > > > > > > > > > > > eventually signal
> > > > > > > > > > > > + * dma_fences this can trivially
> > > > > > > > > > > > lead to
> > > > > > > > > > > > a
> > > > > > > > > > > > system wide
> > > > > > > > > > > > + * stop because of a locked up
> > > > > > > > > > > > memory
> > > > > > > > > > > > management.
> > > > > > > > > > > > + */
> > > > > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity-
> > > > > > > > > > > > > job_queue));
> > > > > > > > > > > > +
> > > > > > > > > > > >       /*
> > > > > > > > > > > >       * Prevents reinsertion and marks
> > > > > > > > > > > > job_queue
> > > > > > > > > > > > as idle,
> > > > > > > > > > > >       * it will removed from rq in
> > > > > > > > > > > > drm_sched_entity_fini
> > > > > > > > > > > >       * eventually
> > > > > > > > > > > >       */
> > > > > > > > > > > >       s_entity->stopped = true;
> > > > > > > > > > > > + }
> > > > > > > > > > > >       spin_unlock(&rq->lock);
> > > > > > > > > > > >       kfree(sched->sched_rq[i]);
> > > > > > > > > > > >       }
>