Re: [PATCH 1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Christian König <christian.koenig@xxxxxxx> · Fri, 15 Nov 2024 12:26:44 +0100

Interesting, those mails just showed up in my inbox as unread. More than 
14 days after you send it.

So sorry for the late reply.

Am 29.10.24 um 08:22 schrieb Philipp Stanner:
Christian, Sima?

Matthew? (+CC)

Opinions on the below?

tl;dr:
I still think it's a good thing to detectably block in
drm_sched_fini(), or at the very least drm_sched_flush(), because then
you'll find out that the driver is broken and can repair it.

As discussed in private mail as well, the third option not mentioned so 
far is to completely drop the free_job callback.

A good bunch of the problems we have here are caused by abusing the job 
as state for executing the submission on the hardware.

The original design idea of the scheduler instead was to only have the 
job as intermediate state between submission and picking what to execute 
next. E.g. the scheduler in that view is just a pipeline were you push 
jobs in on one end and jobs come out on the other end.

In that approach the object which represents the hardware execution is 
the dma_fence instead of the job. And the dma_fence has a well defined 
lifetime, error handling, etc...

So when we go back to this original approach it would mean that we don't 
need to wait for any cleanup because the scheduler wouldn't be involved 
in the execution of the jobs on the hardware any more.

The worst thing that could happen is that the driver messes things up 
and still has not executed job in an entity, but in that case we could 
just warn, block for the hardware fence and then kill all pending 
submissions with an error.- And this blocking on the hardware fence 
can't deadlock because we know that it is a hardware fence with certain 
properties and not some random free_job callback where we don't have any 
idea what locks the driver need.

Regards,
Christian.


P.


On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:
Am 25.09.24 um 16:53 schrieb Philipp Stanner:
On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
wrote:
Am 20.09.24 um 15:26 schrieb Philipp Stanner:
On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
Am 20.09.24 um 10:57 schrieb Philipp Stanner:
On Wed, 2024-09-18 at 15:39 +0200, Christian König
wrote:
Tearing down the scheduler with jobs still on the
pending
list
can
lead to use after free issues. Add a warning if
drivers try
to
destroy a scheduler which still has work pushed to
the HW.
Did you have time yet to look into my proposed
waitque-
solution?
I don't remember seeing anything. What have I missed?
https://lore.kernel.org/all/20240903094446.29797-2-pstanner@xxxxxxxxxx/
Mhm, I didn't got that in my inbox for some reason.

Interesting approach, I'm just not sure if we can or should
wait in
drm_sched_fini().
We do agree that jobs still pending when drm_sched_fini()
starts
is
always a bug, right?
Correct, the question is how to avoid that.

If so, what are the disadvantages of waiting in
drm_sched_fini()?
We
could block buggy drivers as I see it. Which wouldn't be good,
but
could then be fixed on drivers' site.
Sima explained that pretty well: Don't block in fops->close, do
that in
fops->flush instead.
I agree that we shouldn't block in close(), but this effectively
means that we
need to reference count the scheduler, right?

Otherwise, if we allow to just skip / interrupt the teardown, we
can
still leak
memory.
Having thought about it, I agree with Danilo. Having something that
shall wait on green light, but can be interrupted, is no guarantee
and
therefore not a feasible solution.

To break down the solution space, these seem to be our options:
    1. We have something (either drm_sched_fini() or a helper, e.g.,
       drm_sched_flush()) that definitely blocks until the pending
list
       has become empty.
    2. We have jobs reference-count the scheduler, so the latter can
       outlive the driver and will be freed some time later.

Can anyone think of a third solution?


Solution #1 has the problem of obviously blocking unconditionally if
the driver didn't make sure that all fences will be signaled within
reasonable time. In my opinion, this would actually be an advantage,
because it will be *very* noticable and force users to repair their
driver. The driver *has* to guarantee that all fences will be
signaled.
If the driver has to do fishy things, having the blocking outsourced
to
the helper drm_sched_flush() would allow them to circumvent that.

Solution #2 has the problem of backend_ops.free_job() potentially
using
driver-data after the driver is gone, causing UAF. So with this
solutions all drivers would have to be aware of the issue and handle
it
through one of DRMs primitives dedicated to such problems.


Currently, all drivers either work around the problem internally or
simply ignore it, it seems.

So I'd argue that both solutions are an improvement over the existing
situation. My preference would be #1.


Opinions?

P.

One issue this solves is that when you send a SIGTERM the tear
down
handling
first flushes all the FDs and then closes them.

So if flushing the FDs blocks because the process initiated
sending
a
terabyte of data over a 300bps line (for example) you can still
throw a
SIGKILL and abort that as well.

If you would block in fops-close() that SIGKILL won't have any
effect any
more because by the time close() is called the process is gone
and
signals
are already blocked.

And yes when I learned about that issue I was also buffed that
handling like
this in the UNIX design is nearly 50 years old and still applies
to
today.
Probably better to make that a separate function, something
like
drm_sched_flush() or similar.
We could do that. Such a function could then be called by
drivers
which
are not sure whether all jobs are done before they start
tearing
down.
Yes exactly that's the idea. And give that flush function a
return
code so
that it can return -EINTR.

Yeah I don't think we should smash this into drm_sched_fini
unconditionally. I think conceptually there's about three
cases:

- Ringbuffer schedules. Probably want everything as-is,
because
    drm_sched_fini is called long after all the entities are
gone in
    drm_device cleanup.

- fw scheduler hardware with preemption support. There we
probably
want to
    nuke the context by setting the tdr timeout to zero (or
maybe just
as
    long as context preemption takes to be efficient), and
relying on
the
    normal gpu reset flow to handle things.
drm_sched_entity_flush
kinda
    does this, except not really and it's a lot more focused
on
the
    ringbuffer context. So maybe we want a new
drm_sched_entity_kill.

    For this case calling drm_sched_fini() after the 1:1
entity
is gone
    should not find any linger jobs, it would actually be a
bug
somewhere if
    there's a job lingering. Maybe a sanity check that there's
not just
no
    jobs lingering, but also no entity left would be good
here?
The check for lingering ones is in Christian's patch here IISC.
At which position would you imagine the check for the entity
being
performed?

- fw scheduler without preemption support. There we kinda
need
the
    drm_sched_flush, except blocking in fops->close is not
great. So
instead
    I think the following is better:
    1. drm_sched_entity_stopped, which only stops new
submissions (for
    paranoia) but doesn't tear down the entity
Who would call that function?
Drivers using it voluntarily could just as well stop accepting
new jobs
from userspace to their entities, couldn't they?

    2. drm_dev_get
    3. launch a worker which does a) drm_sched_flush (or
    drm_sched_entity_flush or whatever we call it) b)
drm_sched_entity_fini
    + drm_sched_fini c) drm_dev_put

    Note that semantically this implements the refcount in the
other
path
    from Phillip:

https://lore.kernel.org/all/20240903094531.29893-2-pstanner@xxxxxxxxxx/
    Except it doesn't impose refcount on everyone else who
doesn't need
it,
    and it doesn't even impose refcounting on drivers that do
need it
    because we use drm_sched_flush and a worker to achieve the
same.
I indeed wasn't happy with the refcount approach for that
reason,
agreed.

Essentially helper functions for the common use-cases instead
of
trying to
solve them all by putting drm_sched_flush as a potentially
very
blocking
function into drm_sched_fini.
I'm still not able to see why it blocking would be undesired –
as
far
as I can see, it is only invoked on driver teardown, so not
during
active operation. Teardown doesn't happen that often, and it
can
(if
implemented correctly) only block until the driver's code has
signaled
the last fences. If that doesn't happen, the block would reveal
a
bug.

But don't get me wrong: I don't want to *push* this solution. I
just
want to understand when it could become a problem.


Wouldn't an explicitly blocking, separate function like
drm_sched_flush() or drm_sched_fini_flush() be a small, doable
step
towards the right direction?
I think that this is the right thing to do, yes.

When there are still entities with jobs the
situation
is
even
worse
since the dma_fences for those jobs can never
signal
we can
just
choose between potentially locking up core memory
management and
random memory corruption. When drivers really mess
it
up
that
well
let them run into a BUG_ON().

Signed-off-by: Christian
König<christian.koenig@xxxxxxx>
---
     drivers/gpu/drm/scheduler/sched_main.c | 19
++++++++++++++++++-
     1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index f093616fe53c..8a46fab5cdc8 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
drm_gpu_scheduler
*sched)
I agree with Sima that it should first be documented
in
the
function's
docstring what the user is expected to have done
before
calling the
function.
Good point, going to update the documentation as well.
Cool thing, thx.
Would be great if everything (not totally trivial)
necessary to
be done
before _fini() is mentioned.

One could also think about providing a hint at how the
driver can
do
that. AFAICS the only way for the driver to ensure that
is
to
maintain
its own, separate list of submitted jobs.
Even with a duplicated pending list it's actually currently
impossible to do
this fully cleanly.

The problem is that the dma_fence object gives no guarantee
when
callbacks
are processed, e.g. they can be both processed from
interrupt
context as
well as from a CPU which called dma_fence_is_signaled().

So when a driver (or drm_sched_fini) waits for the last
submitted
fence it
actually can be that the drm_sched object still needs to do
some
processing.
See the hack in amdgpu_vm_tlb_seq() for more background on
the
problem.
Oh dear ^^'
We better work towards fixing that centrally

Thanks,
P.


So I thought this should be fairly easy because of the sched
hw/public
fence split: If we wait for both all jobs to finish and for
all
the
sched
work/tdr work to finish, and we make sure there's no entity
existing
that's not yet stopped we should catch them all?
Unfortunately not.

Even when you do a dma_fence_wait() on the last submission it can
still be
that another CPU is executing the callbacks to wake up the
scheduler work
item and cleanup the job.

That's one of the reasons why I think the design of keeping the
job
alive is
so extremely awkward. The dma_fence as representation of the hw
submission
has a much better defined state machine and lifetime.

Regards,
Christian.

   Or at least I think
it's
a bug if any other code even tries to touch the hw fence.

If you have any other driver code which relies on the rcu
freeing
then I
think that's just a separate concern and we can ignore that
here
since the
fences themselves will till get rcu-delay freed even if
drm_sched_fini has
finished.
-Sima

Regards,
Christian.

P.

Thanks,
Christian.

P.

      drm_sched_wqueue_stop(sched);
+ /*
+ * Tearing down the scheduler wile there are still
unprocessed jobs can
+ * lead to use after free issues in the scheduler
fence.
+ */
+ WARN_ON(!list_empty(&sched->pending_list));
+
      for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched-
num_rqs;
i++)
{
      struct drm_sched_rq *rq = sched-
sched_rq[i];
      spin_lock(&rq->lock);
- list_for_each_entry(s_entity, &rq-
entities,
list)
+ list_for_each_entry(s_entity, &rq-
entities,
list) {
+ /*
+ * The justification for this
BUG_ON()
is
that tearing
+ * down the scheduler while jobs
are
pending
leaves
+ * dma_fences unsignaled. Since we
have
dependencies
+ * from the core memory management
to
eventually signal
+ * dma_fences this can trivially
lead to
a
system wide
+ * stop because of a locked up
memory
management.
+ */
+ BUG_ON(spsc_queue_count(&s_entity-
job_queue));
+
      /*
      * Prevents reinsertion and marks
job_queue
as idle,
      * it will removed from rq in
drm_sched_entity_fini
      * eventually
      */
      s_entity->stopped = true;
+ }
      spin_unlock(&rq->lock);
      kfree(sched->sched_rq[i]);
      }