Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down

Christian König <christian.koenig@xxxxxxx> · Wed, 8 Mar 2023 19:12:00 +0100

Am 08.03.23 um 18:32 schrieb Asahi Lina:
[SNIP]
Yes but... none of this cleans up jobs that are already submitted by the
scheduler and in its pending list, with registered completion callbacks,
which were already popped off of the entities.

*That* is the problem this patch fixes!

Ah! Yes that makes more sense now.

We could add a warning when users of this API doesn't do this
correctly, but cleaning up incorrect API use is clearly something we
don't want here.
It is the job of the Rust abstractions to make incorrect API use that
leads to memory unsafety impossible. So even if you don't want that in
C, it's my job to do that for Rust... and right now, I just can't
because drm_sched doesn't provide an API that can be safely wrapped
without weird bits of babysitting functionality on top (like tracking
jobs outside or awkwardly making jobs hold a reference to the scheduler
and defer dropping it to another thread).

Yeah, that was discussed before but rejected.

The argument was that upper layer needs to wait for the hw to become 
idle before the scheduler can be destroyed anyway.

Right now, it is not possible to create a safe Rust abstraction for
drm_sched without doing something like duplicating all job tracking in
the abstraction, or the above backreference + deferred cleanup mess, or
something equally silly. So let's just fix the C side please ^^
Nope, as far as I can see this is just not correctly tearing down the
objects in the right order.
There's no API to clean up in-flight jobs in a drm_sched at all.
Destroying an entity won't do it. So there is no reasonable way to do
this at all...

Yes, this was removed.

So you are trying to do something which is not supposed to work in the
first place.
I need to make things that aren't supposed to work impossible to do in
the first place, or at least fail gracefully instead of just oopsing
like drm_sched does today...

If you're convinced there's a way to do this, can you tell me exactly
what code sequence I need to run to safely shut down a scheduler
assuming all entities are already destroyed? You can't ask me for a list
of pending jobs (the scheduler knows this, it doesn't make any sense to
duplicate that outside), and you can't ask me to just not do this until
all jobs complete execution (because then we either end up with the
messy deadlock situation I described if I take a reference, or more
duplicative in-flight job count tracking and blocking in the free path
of the Rust abstraction, which doesn't make any sense either).

Good question. We don't have anybody upstream which uses the scheduler 
lifetime like this.

Essentially the job list in the scheduler is something we wanted to 
remove because it causes tons of race conditions during hw recovery.

When you tear down the firmware queue how do you handle already 
submitted jobs there?

Regards,
Christian.

~~ Lina