Re: [RFC PATCH 00/10] drm/panthor: Add user submission

Christian König <christian.koenig@xxxxxxx> · Wed, 4 Sep 2024 15:36:38 +0200

Am 04.09.24 um 14:46 schrieb Steven Price:
On 04/09/2024 12:34, Christian König wrote:
Hi Boris,

Am 04.09.24 um 13:23 schrieb Boris Brezillon:
Please read up here on why that stuff isn't allowed:
https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
panthor doesn't yet have a shrinker, so all memory is pinned, which means
memory management easy mode.
Ok, that at least makes things work for the moment.
Ah, perhaps this should have been spelt out more clearly ;)

The VM_BIND mechanism that's already in place jumps through some hoops
to ensure that memory is preallocated when the memory operations are
enqueued. So any memory required should have been allocated before any
sync object is returned. We're aware of the issue with memory
allocations on the signalling path and trying to ensure that we don't
have that.

I'm hoping that we don't need a shrinker which deals with (active) GPU
memory with our design.
That's actually what we were planning to do: the panthor shrinker was
about to rely on fences attached to GEM objects to know if it can
reclaim the memory. This design relies on each job attaching its fence
to the GEM mapped to the VM at the time the job is submitted, such that
memory that's in-use or about-to-be-used doesn't vanish before the GPU
is done.
How progressed is this shrinker? It would be good to have an RFC so that
we can look to see how user submission could fit in with it.

Yeah and exactly that doesn't work any more when you are using user
queues, because the kernel has no opportunity to attach a fence for each
submission.
User submission requires a cooperating user space[1]. So obviously user
space would need to ensure any BOs that it expects will be accessed to
be in some way pinned. Since the expectation of user space submission is
that we're reducing kernel involvement, I'd also expect these to be
fairly long-term pins.

[1] Obviously with a timer to kill things from a malicious user space.

The (closed) 'kbase' driver has a shrinker but is only used on a subset
of memory and it's up to user space to ensure that it keeps the relevant
parts pinned (or more specifically not marking them to be discarded if
there's memory pressure). Not that I think we should be taking it's
model as a reference here.

Memory which user space thinks the GPU might
need should be pinned before the GPU work is submitted. APIs which
require any form of 'paging in' of data would need to be implemented by
the GPU work completing and being resubmitted by user space after the
memory changes (i.e. there could be a DMA fence pending on the GPU work).
Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
that means we can't really transparently swap out GPU memory, or we
have to constantly pin/unpin around each job, which means even more
ioctl()s than we have now. Another option would be to add the XGS fence
to the BOs attached to the VM, assuming it's created before the job
submission itself, but you're no longer reducing the number of user <->
kernel round trips if you do that, because you now have to create an
XSG job for each submission, so you basically get back to one ioctl()
per submission.
As you say the granularity of pinning has to be fairly coarse for user
space submission to make sense. My assumption (could be wildly wrong)
was that most memory would be pinned whenever a context is rendering.

For AMDGPU we are currently working on the following solution with
memory management and user queues:

1. User queues are created through an kernel IOCTL, submissions work by
writing into a ring buffer and ringing a doorbell.

2. Each queue can request the kernel to create fences for the currently
pushed work for a queues which can then be attached to BOs, syncobjs,
syncfiles etc...

3. Additional to that we have and eviction/preemption fence attached to
all BOs, page tables, whatever resources we need.

4. When this eviction fences are requested to signal they first wait for
all submission fences and then suspend the user queues and block
creating new submission fences until the queues are restarted again.

This way you can still do your memory management inside the kernel (e.g.
move BOs from local to system memory) or even completely suspend and
resume applications without their interaction, but as Sima said it is
just horrible complicated to get right.

We have been working on this for like two years now and it still could
be that we missed something since it is not in production testing yet.
I'm not entirely sure I follow how this doesn't create a dependency
cycle. From your description it sounds like you create a fence from the
user space queue which is then used to prevent eviction of the BOs needed.

So to me it sounds like:

1. Attach fence to BOs to prevent eviction.

2. User space submits work to the ring buffer, rings doorbell.

3. Call into the kernel to create the fence for step 1.

Which is obviously broken. What am I missing?

Those are two different fences, we have an eviction fence and a 
submission fence.

The eviction fence takes care of making sure the memory is not moved 
while submissions are running.

And the submission fence is the actual signal to completion.

One other thing to note is that Mali doesn't have local memory - so the
only benefit of unpinning is if we want to swap to disk (or zram etc).

Yeah that makes sense, but the problem remains that with an userspace 
submission module you completely nail that pinning.

Regards,
Christian.

Steve