Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

Daniel Vetter <daniel@xxxxxxxx> · Wed, 2 Jun 2021 21:19:07 +0200

On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> 
> 
> Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@xxxxxxxxx> wrote:
> > > 
> > > > Yes, we can't break anything because we don't want to complicate things
> > > > for us. It's pretty much all NAK'd already. We are trying to gather more
> > > > knowledge and then make better decisions.
> > > > 
> > > > The idea we are considering is that we'll expose memory-based sync objects
> > > > to userspace for read only, and the kernel or hw will strictly control the
> > > > memory writes to those sync objects. The hole in that idea is that
> > > > userspace can decide not to signal a job, so even if userspace can't
> > > > overwrite memory-based sync object states arbitrarily, it can still decide
> > > > not to signal them, and then a future fence is born.
> > > > 
> > > This would actually be treated as a GPU hang caused by that context, so it
> > > should be fine.
> > This is practically what I proposed already, except your not doing it with
> > dma_fence. And on the memory fence side this also doesn't actually give
> > what you want for that compute model.
> > 
> > This seems like a bit a worst of both worlds approach to me? Tons of work
> > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > actually drive the hardware like it should be for compute or direct
> > display.
> > 
> > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > how to fake the entire dma_fence stuff on top of new hw. Would be
> > interesting to know what doesn't work there instead of amd folks going of
> > into internal again and then coming back with another rfc from out of
> > nowhere :-)
> 
> Well to be honest I would just push back on our hardware/firmware guys that
> we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).

The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
  write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
  blindly trusts userspace to set up everything else, and even just wraps
  dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
  kernel shot the entire context and b) userspace recovers by setting up a
  new ringbuffer

- memory management is done using ttm only, you still need to supply the
  buffer list (ofc that list includes the always present ones, so CS will
  only get the list of special buffers like today). If you hw can't trun
  gpu page faults and you ever get one we pull up the same old solution:
  Kernel shots the entire context.

  The important thing is that from the gpu pov memory management works
  exactly like compute workload with direct submit, except that you just
  terminate the context on _any_ page fault, instead of only those that go
  somewhere where there's really no mapping and repair the others.

  Also I guess from reading the old thread this means you'd disable page
  fault retry because that is apparently also way too slow for anything.

- memory management uses an unload fence. That unload fences waits for all
  userspace memory fences (represented as dma_fence) to complete, with
  maybe some fudge to busy-spin until we've reached the actual end of the
  ringbuffer (maybe you have a IB tail there after the memory fence write,
  we have that on intel hw), and it waits for the memory to get
  "unloaded". This is either preemption, or pte clearing + tlb shootdown,
  or whatever else your hw provides which is a) used for dynamic memory
  management b) fast enough for actual memory management.

- any time a context dies we force-complete all it's pending fences,
  in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact
same mappings, command sequences and everything else. The only difference
is that the rinbuffer head/tail updates happen from drm/scheduler, instead
of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the
writes to memory fences, or where you need kernel ringbuffers, or anything
like thist. Userspace is allowed to do anything stupid, the rules are
guaranteed with:

- we rely on the hw isolation features to work, but _exactly_ like compute
  direct submit would too

- dying on any page fault captures memory management issues

- dying (without kernel recover, this is up to userspace if it cares) on
  any tdr makes sure fences complete still

> That syncfile and all that Android stuff isn't working out of the box with
> the new shiny user queue submission model (which in turn is mostly because
> of Windows) already raised some eyebrows here.

I think if you really want to make sure the current linux stack doesn't
break the _only_ option you have is provide a ctx mode that allows
dma_fence and drm/scheduler to be used like today.

For everything else it sounds you're a few years too late, because even
just huge kernel changes wont happen in time. Much less rewriting
userspace protocols.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch