Re: [RFC PATCH 00/10] drm/panthor: Add user submission

Christian König <christian.koenig@xxxxxxx> · Mon, 11 Nov 2024 13:57:15 +0100

Am 08.11.24 um 23:27 schrieb Matthew Brost:
On Tue, Sep 24, 2024 at 11:30:53AM +0200, Simona Vetter wrote:
Apologies for the late reply ...

Also late reply, just read this.

On Wed, Sep 04, 2024 at 01:34:18PM +0200, Christian König wrote:
Hi Boris,

Am 04.09.24 um 13:23 schrieb Boris Brezillon:
Please read up here on why that stuff isn't allowed:
https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
panthor doesn't yet have a shrinker, so all memory is pinned, which means
memory management easy mode.
Ok, that at least makes things work for the moment.
Ah, perhaps this should have been spelt out more clearly ;)

The VM_BIND mechanism that's already in place jumps through some hoops
to ensure that memory is preallocated when the memory operations are
enqueued. So any memory required should have been allocated before any
sync object is returned. We're aware of the issue with memory
allocations on the signalling path and trying to ensure that we don't
have that.

I'm hoping that we don't need a shrinker which deals with (active) GPU
memory with our design.
That's actually what we were planning to do: the panthor shrinker was
about to rely on fences attached to GEM objects to know if it can
reclaim the memory. This design relies on each job attaching its fence
to the GEM mapped to the VM at the time the job is submitted, such that
memory that's in-use or about-to-be-used doesn't vanish before the GPU
is done.
Yeah and exactly that doesn't work any more when you are using user queues,
because the kernel has no opportunity to attach a fence for each submission.

Memory which user space thinks the GPU might
need should be pinned before the GPU work is submitted. APIs which
require any form of 'paging in' of data would need to be implemented by
the GPU work completing and being resubmitted by user space after the
memory changes (i.e. there could be a DMA fence pending on the GPU work).
Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
that means we can't really transparently swap out GPU memory, or we
have to constantly pin/unpin around each job, which means even more
ioctl()s than we have now. Another option would be to add the XGS fence
to the BOs attached to the VM, assuming it's created before the job
submission itself, but you're no longer reducing the number of user <->
kernel round trips if you do that, because you now have to create an
XSG job for each submission, so you basically get back to one ioctl()
per submission.
For AMDGPU we are currently working on the following solution with memory
management and user queues:

1. User queues are created through an kernel IOCTL, submissions work by
writing into a ring buffer and ringing a doorbell.

2. Each queue can request the kernel to create fences for the currently
pushed work for a queues which can then be attached to BOs, syncobjs,
syncfiles etc...

3. Additional to that we have and eviction/preemption fence attached to all
BOs, page tables, whatever resources we need.

4. When this eviction fences are requested to signal they first wait for all
submission fences and then suspend the user queues and block creating new
submission fences until the queues are restarted again.
Yup this works, at least when I play it out in my head.

I just started experimenting with user submission in Xe last week and
ended up landing on a different PoC, blissfully unaware future fences /
Mesa submit thread. However, after Sima filled me in, I’ve essentially
landed on exactly what Christian is describing in Xe. I haven’t coded it
yet, but have the design in my head.

Sounds like going over that design again and again was good invested time.

And yeah we have it working and at least so far it really looks like it 
works.

I also generally agree with Sima’s comments about having a somewhat
generic preempt fence (Christian refers to this as an eviction fence)
as well.

Well that is really a bike-sheet.

I don't care if we call it preempt fence or eviction fence as long as 
everybody understands what that thing is supposed to do.

Probably something we should document.

Additionally, I’m thinking it might be beneficial for us to add a new
'preempt' dma-resv slot to track these, which would make it easier to
enforce the ordering of submission fence signaling before preempt
fences.

That's exactly what DMA_RESV_USAGE_BOOKKEEP is good for.

And yes, I spend really *a lot of time* planning this :)

Depending on bandwidth, I may post an RFC to the list soon. I’ll also
gauge the interest and bandwidth from our Mesa team to begin UMD work.

Please loop me in as well.

Regards,
Christian.

Matt

Note that the completion fence is only deadlock free if userspace is
really, really careful. Which in practice means you need the very
carefully constructed rules for e.g. vulkan winsys fences, otherwise you
do indeed deadlock.

But if you keep that promise in mind, then it works, and step 2 is
entirely option, which means we can start userspace in a pure long-running
compute mode where there's only eviction/preemption fences. And then if
userspace needs a vulkan winsys fence, we can create that with step 2 as
needed.

But the important part is that you need really strict rules on userspace
for when step 2 is ok and won't result in deadlocks. And those rules are
uapi, which is why I think doing this in panthor without the shrinker and
eviction fences (i.e. steps 3&4 above) is a very bad mistake.

This way you can still do your memory management inside the kernel (e.g.
move BOs from local to system memory) or even completely suspend and resume
applications without their interaction, but as Sima said it is just horrible
complicated to get right.

We have been working on this for like two years now and it still could be
that we missed something since it is not in production testing yet.
Ack.
-Sima
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch