Re: [PATCH RFC 10/18] drm/scheduler: Add can_run_job callback

Christian König <christian.koenig@xxxxxxx> · Wed, 8 Mar 2023 20:12:38 +0100

Am 08.03.23 um 20:05 schrieb Asahi Lina:
[SNIP]
Well it's not the better way, it's the only way that works.

I have to admit that my bet on your intentions was wrong, but even that
use case doesn't work correctly.

See when your callback returns false it is perfectly possible that all
hw fences are signaled between returning that information and processing it.

The result would be that the scheduler goes to sleep and never wakes up
again.
That can't happen, because it will just go into another iteration of the
drm_sched main loop since there is an entity available still.

Rather there is probably the opposite bug in this patch: the can_run_job
logic should be moved into the wait_event_interruptible() condition
check, otherwise I think it can end up busy-looping since the condition
itself can be true even when the can_run_job check blocks it.

But there is no risk of it going to sleep and never waking up because
job completions will wake up the waitqueue by definition, and that
happens after the driver-side queues are popped. If this problem could
happen, then the existing hw_submission_limit logic would be broken in
the same way. It is logically equivalent in how it works.

Basically, if properly done in wait_event_interruptible, it is exactly
the logic of that macro that prevents this race condition and makes
everything work at all. Without it, drm_sched would be completely broken.

As I said we exercised those ideas before and yes this approach here
came up before as well and no it doesn't work.
It can never deadlock with this patch as it stands (though it could busy
loop), and if properly moved into the wait_event_interruptible(), it
would also never busy loop and work entirely as intended. The actual API
change is sound.

I don't know why you're trying so hard to convince everyone that this
approach is fundamentally broken... It might be a bad idea for other
reasons, it might encourage incorrect usage, it might not be the best
option, there are plenty of arguments you can make... but you just keep
trying to make an argument that it just can't work at all for some
reason. Why? I already said I'm happy dropping it in favor of the fences...

Well because it is broken.

When you move the check into the wait_event_interruptible condition then 
who is going to call wait_event_interruptible when the condition changes?

As I said this idea came up before and was rejected multiple times.

Regards,
Christian.

It's intended to mirror the hw_submission_limit logic. If you think this
is broken, then that's broken too. They are equivalent mechanisms.

This particular issue aside, fairness in global resource allocation is a
conversation I'd love to have! Right now the driver doesn't try to
ensure that, a queue can easily monopolize certain hardware resources
(though one queue can only monopolize one of each, so you'd need
something like 63 queues with 63 distinct VMs all submitting
free-running jobs back to back in order to starve other queues of
resources forever). For starters, one thing I'm thinking of doing is
reserving certain subsets of hardware resources for queues with a given
priority, so you can at least guarantee forward progress of
higher-priority queues when faced with misbehaving lower-priority
queues. But if we want to guarantee proper fairness, I think I'll have
to start doing things like switching to a CPU-roundtrip submission model
when resources become scarce (to guarantee that queues actually release
the resources once in a while) and then figure out how to add fairness
to the allocation code...

But let's have that conversation when we talk about the driver (or maybe
on IRC or something?), right now I'm more interested in getting the
abstractions reviewed ^^
Well that stuff is highly problematic as well. The fairness aside you
risk starvation which in turn breaks the guarantee of forward progress.

In this particular case you can catch this with a timeout for the hw
operation, but you should consider blocking that from the sw side as well.
In the current state I actually think it's not really that problematic,
because the resources are acquired directly in the ioctl path. So that
can block if starved, but if that can cause overall forward progress to
stop because some fence doesn't get signaled, then so can just not doing
the ioctl in the first place, so there's not much point (userspace can
always misbehave with its fence usage...). By the time anything gets
submitted to drm_sched, the resources are already guaranteed to be
acquired, we never block in the run callback.

It needs to be fixed of course, but if the threat model is a malicious
GPU process, well, there are many other ways to DoS your system... and I
don't think it's very likely that 63+ queues (which usually means 63+
processes with OpenGL) will end up accidentally starving the GPU in a
tight loop at the same time. I'd love to hear about real-world scenarios
where this kind of thing has been a real problem and not just a
theoretical one though... maybe I'm missing something?

Basically my priorities with the driver are:

1. Make sure it never crashes
2. Make sure it works well for real users
3. Make it work smoothly for real users under reasonable load
(priorities, CPU scheduler interactions, etc.)
4. Make it handle accidental problems more gracefully (OOMs etc, I need
to look into private GEM BO accounting to processes so the OOM killer
has better data to work with)
5. Make it more robust against deliberate abuse/starvation (this should
matter more once we have some kind of paravirtualization solution...)

And right now we're somewhere between 2 and 3. So if there are cases
where this resource acquisition stuff can cause a problem for real
users, I'll want to fix it earlier. But if this is more theoretical than
anything (with the resource limits of AGX GPUs), I'd rather focus on
things like memory accounting and shrinker support first.

We don't even have a shrinker yet, and I'm sure that's going to be a lot
of fun when we add it too... but yes, if we can't do any memory
allocations in some of these callbacks (is this documented anywhere?),
that's going to be interesting...
Yes, that is all part of the dma_fence documentation. It's just
absolutely not obvious what all this means.
I mean is there any documentation on how this interacts with drm_sched?
Like, am I not allowed to allocate memory in prepare()? What about
run()? What about GPU interrupt work? (not a raw IRQ handler context, I
mean the execution path from GPU IRQ to drm_sched run() fences getting
signaled)

It's not all bad news though! All memory allocations are fallible in
kernel Rust (and therefore explicit, and also failures have to be
explicitly handled or propagated), so it's pretty easy to point out
where they are, and there are already discussions of higher-level
tooling to enforce rules like that (and things like wait contexts).
Also, Rust makes it a lot easier to refactor code in general and not be
scared that you're going to regress everything, so I'm not really
worried if I need to turn a chunk of the driver on its head to solve
some of these problems in the future ^^ (I already did that when I
switched it from the "demo" synchronous submission model to the proper
explicit sync + fences one.)
Yeah, well the problem isn't that you run into memory allocation failure.
What I mean is that the mandatory failure handling means it's relatively
easy to audit where memory allocations can actually happen.

The problem is rather something like this:
1. You try to allocate memory to signal your fence.
2. This memory allocation can't be fulfilled and goes to sleep to wait
for reclaim.
3. On another CPU reclaim is running and through the general purpose
shrinker, page fault or MMU notifier ends up wait for your dma_fence.

You don't even need to implement the shrinker for this to go boom
extremely easy.
Hmm, can you actually get something waiting on a dma_fence like that
today with this driver? We don't have a shrinker, we don't have
synchronous page faults or MMU notifications for the GPU, and this is
explicit sync so all in/out fences cross over into userspace so surely
they can't be trusted anyway?

I'm definitely not familiar with the intricacies of DMA fences and how
they interact with everything else yet, but it's starting to sound like
either this isn't quite broken for our simple driver yet, or it must be
broken pretty much everywhere in some way...

So everything involved with signaling the fence can allocate memory only
with GFP_ATOMIC and only if you absolutely have to.
I don't think we even have a good story for passing around gfp_flags in
Rust code so that will be interesting... though I need to actually audit
the code paths and see how many allocations we really do. I know I alloc
some vectors for holding completed commands and stuff like that, but I'm
pretty sure I can fix that one with some reworking, and I'm not sure how
many other random things there really are...? Obviously most allocations
happen at command creation time, on completion you mostly get a lot of
freeing, so maybe I can just eliminate all allocs and not worry about
GFP_ATOMIC.

~~ Lina