Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Fri, 4 Jun 2021 09:00:31 +0200

Am 02.06.21 um 21:19 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:

Am 02.06.21 um 20:48 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@xxxxxxxxx> wrote:

Yes, we can't break anything because we don't want to complicate things
for us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context, so it
should be fine.
This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)
Well to be honest I would just push back on our hardware/firmware guys that
we need to keep kernel queues forever before going down that route.
I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).

Have you seen that one here: 
https://www.spinics.net/lists/amd-gfx/msg63101.html :)

I've rejected it because I think polling for 6 seconds on a TLB flush 
which can block interrupts as well is just madness.

The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
   write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
   blindly trusts userspace to set up everything else, and even just wraps
   dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
   kernel shot the entire context and b) userspace recovers by setting up a
   new ringbuffer

- memory management is done using ttm only, you still need to supply the
   buffer list (ofc that list includes the always present ones, so CS will
   only get the list of special buffers like today). If you hw can't trun
   gpu page faults and you ever get one we pull up the same old solution:
   Kernel shots the entire context.

   The important thing is that from the gpu pov memory management works
   exactly like compute workload with direct submit, except that you just
   terminate the context on _any_ page fault, instead of only those that go
   somewhere where there's really no mapping and repair the others.

   Also I guess from reading the old thread this means you'd disable page
   fault retry because that is apparently also way too slow for anything.

- memory management uses an unload fence. That unload fences waits for all
   userspace memory fences (represented as dma_fence) to complete, with
   maybe some fudge to busy-spin until we've reached the actual end of the
   ringbuffer (maybe you have a IB tail there after the memory fence write,
   we have that on intel hw), and it waits for the memory to get
   "unloaded". This is either preemption, or pte clearing + tlb shootdown,
   or whatever else your hw provides which is a) used for dynamic memory
   management b) fast enough for actual memory management.

- any time a context dies we force-complete all it's pending fences,
   in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact
same mappings, command sequences and everything else. The only difference
is that the rinbuffer head/tail updates happen from drm/scheduler, instead
of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the
writes to memory fences, or where you need kernel ringbuffers, or anything
like thist. Userspace is allowed to do anything stupid, the rules are
guaranteed with:

- we rely on the hw isolation features to work, but _exactly_ like compute
   direct submit would too

- dying on any page fault captures memory management issues

- dying (without kernel recover, this is up to userspace if it cares) on
   any tdr makes sure fences complete still

That syncfile and all that Android stuff isn't working out of the box with
the new shiny user queue submission model (which in turn is mostly because
of Windows) already raised some eyebrows here.
I think if you really want to make sure the current linux stack doesn't
break the _only_ option you have is provide a ctx mode that allows
dma_fence and drm/scheduler to be used like today.

Yeah, but I still can just tell our hw/fw guys that we really really 
need to keep kernel queues or the whole Linux/Android infrastructure 
needs to get a compatibility layer like you describe above.

For everything else it sounds you're a few years too late, because even
just huge kernel changes wont happen in time. Much less rewriting
userspace protocols.

Seconded, question is rather if we are going to start migrating at some 
point or if we should keep pushing on our hw/fw guys.

Christian.

-Daniel