Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Wed, 28 Apr 2021 15:37:49 +0200

Am 28.04.21 um 15:34 schrieb Daniel Vetter:
On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
Am 28.04.21 um 14:26 schrieb Daniel Vetter:
On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
Am 28.04.21 um 12:05 schrieb Daniel Vetter:
On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@xxxxxxxxxxx> wrote:
On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@xxxxxxxxxxxxxx> wrote:

Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device
FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy
I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.
It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.
Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?
The upcomming hardware generation will have this hardware scheduler as a
must have, but there are certain ways we can still stick to the old
approach:

1. The new hardware scheduler currently still supports kernel queues which
essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially solves
the problem. This way you can't manipulate the ring buffer content, but the
location for the fence must still be writeable.
Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.
Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is submitted,
but also adjust a bunch of privileged hardware registers.

When userspace could do that from its IBs as well then there is nothing
blocking it from reprogramming the page table base address for example.

We could do those writes with the CPU as well, but that would be a huge
performance drop because of the additional latency.
That's not what I'm suggesting. I'm suggesting you have the queue and
everything in userspace, like in wondows. Fences are exactly handled like
on windows too. The difference is:

- All new additions to the ringbuffer are done through a kernel ioctl
   call, using the drm/scheduler to resolve dependencies.

- Memory management is also done like today int that ioctl.

- TDR makes sure that if userspace abuses the contract (which it can, but
   it can do that already today because there's also no command parser to
   e.g. stop gpu semaphores) the entire context is shot and terminally
   killed. Userspace has to then set up a new one. This isn't how amdgpu
   recovery works right now, but i915 supports it and I think it's also the
   better model for userspace error recovery anyway.

So from hw pov this will look _exactly_ like windows, except we never page
fault.

 From sw pov this will look _exactly_ like current kernel ringbuf model,
with exactly same dma_fence semantics. If userspace lies, does something
stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
and tdr kills it if it takes too long.

Where do you need priviledge IB writes or anything like that?

For writing the fence value and setting up the priority and VM registers.

Christian.

Ofc kernel needs to have some safety checks in the dma_fence timeline that
relies on userspace ringbuffer to never go backwards or unsignal, but
that's kinda just more compat cruft to make the kernel/dma_fence path
work.

Cheers, Daniel

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel