Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Tue, 20 Apr 2021 13:59:44 +0200

Yeah. If we go with userspace fences, then userspace can hang itself. Not
the kernel's problem.

Well, the path of inner peace begins with four words. “Not my fucking 
problem.”

But I'm not that much concerned about the kernel, but rather about 
important userspace processes like X, Wayland, SurfaceFlinger etc...

I mean attaching a page to a sync object and allowing to wait/signal 
from both CPU as well as GPU side is not so much of a problem.

You have to somehow handle that, e.g. perhaps with conditional
rendering and just using the old frame in compositing if the new one
doesn't show up in time.

Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.

Regards,
Christian.

Am 20.04.21 um 13:16 schrieb Daniel Vetter:
On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
Daniel, are you suggesting that we should skip any deadlock prevention in
the kernel, and just let userspace wait for and signal any fence it has
access to?
Yeah. If we go with userspace fences, then userspace can hang itself. Not
the kernel's problem. The only criteria is that the kernel itself must
never rely on these userspace fences, except for stuff like implementing
optimized cpu waits. And in those we must always guarantee that the
userspace process remains interruptible.

It's a completely different world from dma_fence based kernel fences,
whether those are implicit or explicit.

Do you have any concern with the deprecation/removal of BO fences in the
kernel assuming userspace is only using explicit fences? Any concern with
the submit and return fences for modesetting and other producer<->consumer
scenarios?
Let me work on the full replay for your rfc first, because there's a lot
of details here and nuance.
-Daniel

Thanks,
Marek

On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter <daniel@xxxxxxxx> wrote:

On Tue, Apr 20, 2021 at 12:15 PM Christian König
<ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
Not going to comment on everything on the first pass...

On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@xxxxxxxxx> wrote:
Hi,

This is our initial proposal for explicit fences everywhere and new
memory management that doesn't use BO fences. It's a redesign of how Linux
graphics drivers work, and it can coexist with what we have now.

1. Introduction
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for
GPUs with only one graphics queue where everything was executed in the
submission order and per-BO fences were used for memory management and
CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
queues were added on top, which required the introduction of implicit
GPU-GPU synchronization between queues of different processes using per-BO
fences. Recently, even parallel execution within one queue was enabled
where a command buffer starts draws and compute shaders, but doesn't wait
for them, enabling parallelism between back-to-back command buffers.
Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
was created to enable all those use cases, and it's the only reason why the
scheduler exists.
The GPU scheduler, implicit synchronization, BO-fence-based memory
management, and the tracking of per-BO fences increase CPU overhead and
latency, and reduce parallelism. There is a desire to replace all of them
with something much simpler. Below is how we could do it.

2. Explicit synchronization for window systems and modesetting

The producer is an application and the consumer is a compositor or a
modesetting driver.
2.1. The Present request

As part of the Present request, the producer will pass 2 fences (sync
objects) to the consumer alongside the presented DMABUF BO:
- The submit fence: Initially unsignalled, it will be signalled when
the producer has finished drawing into the presented buffer.
- The return fence: Initially unsignalled, it will be signalled when
the consumer has finished using the presented buffer.
I'm not sure syncobj is what we want.  In the Intel world we're trying
to go even further to something we're calling "userspace fences" which
are a timeline implemented as a single 64-bit value in some
CPU-mappable BO.  The client writes a higher value into the BO to
signal the timeline.
Well that is exactly what our Windows guys have suggested as well, but
it strongly looks like that this isn't sufficient.

First of all you run into security problems when any application can
just write any value to that memory location. Just imagine an
application sets the counter to zero and X waits forever for some
rendering to finish.
The thing is, with userspace fences security boundary issue prevent
moves into userspace entirely. And it really doesn't matter whether
the event you're waiting on doesn't complete because the other app
crashed or was stupid or intentionally gave you a wrong fence point:
You have to somehow handle that, e.g. perhaps with conditional
rendering and just using the old frame in compositing if the new one
doesn't show up in time. Or something like that. So trying to get the
kernel involved but also not so much involved sounds like a bad design
to me.

Additional to that in such a model you can't determine who is the guilty
queue in case of a hang and can't reset the synchronization primitives
in case of an error.

Apart from that this is rather inefficient, e.g. we don't have any way
to prevent priority inversion when used as a synchronization mechanism
between different GPU queues.
Yeah but you can't have it both ways. Either all the scheduling in the
kernel and fence handling is a problem, or you actually want to
schedule in the kernel. hw seems to definitely move towards the more
stupid spinlock-in-hw model (and direct submit from userspace and all
that), priority inversions be damned. I'm really not sure we should
fight that - if it's really that inefficient then maybe hw will add
support for waiting sync constructs in hardware, or at least be
smarter about scheduling other stuff. E.g. on intel hw both the kernel
scheduler and fw scheduler knows when you're spinning on a hw fence
(whether userspace or kernel doesn't matter) and plugs in something
else. Add in a bit of hw support to watch cachelines, and you have
something which can handle both directions efficiently.

Imo given where hw is going, we shouldn't try to be too clever here.
The only thing we do need to provision is being able to do cpu side
waits without spinning. And that should probably be done in a fairly
gpu specific way still.
-Daniel

Christian.

    The kernel then provides some helpers for
waiting on them reliably and without spinning.  I don't expect
everyone to support these right away but, If we're going to re-plumb
userspace for explicit synchronization, I'd like to make sure we take
this into account so we only have to do it once.

Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which fence.
This information is part of the Present request and supplied by userspace.
This isn't clear to me.  Yes, if we're using anything dma-fence based
like syncobj, this is true.  But it doesn't seem totally true as a
general statement.

- If the producer crashes, the kernel signals the submit fence, so
that the consumer can make forward progress.
- If the consumer crashes, the kernel signals the return fence, so
that the producer can reclaim the buffer.
- A GPU hang signals all fences. Other deadlocks will be handled like
GPU hangs.
What do you mean by "all"?  All fences that were supposed to be
signaled by the hung context?

Other window system requests can follow the same idea.

Merged fences where one fence object contains multiple fences will be
supported. A merged fence is signalled only when its fences are signalled.
The consumer will have the option to redefine the unsignalled return fence
to a merged fence.
2.2. Modesetting

Since a modesetting driver can also be the consumer, the present
ioctl will contain a submit fence and a return fence too. One small problem
with this is that userspace can hang the modesetting driver, but in theory,
any later present ioctl can override the previous one, so the unsignalled
presentation is never used.

3. New memory management

The per-BO fences will be removed and the kernel will not know which
buffers are busy. This will reduce CPU overhead and latency. The kernel
will not need per-BO fences with explicit synchronization, so we just need
to remove their last user: buffer evictions. It also resolves the current
OOM deadlock.
Is this even really possible?  I'm no kernel MM expert (trying to
learn some) but my understanding is that the use of per-BO dma-fence
runs deep.  I would like to stop using it for implicit synchronization
to be sure, but I'm not sure I believe the claim that we can get rid
of it entirely.  Happy to see someone try, though.

3.1. Evictions

If the kernel wants to move a buffer, it will have to wait for
everything to go idle, halt all userspace command submissions, move the
buffer, and resume everything. This is not expected to happen when memory
is not exhausted. Other more efficient ways of synchronization are also
possible (e.g. sync only one process), but are not discussed here.
3.2. Per-process VRAM usage quota

Each process can optionally and periodically query its VRAM usage
quota and change domains of its buffers to obey that quota. For example, a
process allocated 2 GB of buffers in VRAM, but the kernel decreased the
quota to 1 GB. The process can change the domains of the least important
buffers to GTT to get the best outcome for itself. If the process doesn't
do it, the kernel will choose which buffers to evict at random. (thanks to
Christian Koenig for this idea)
This is going to be difficult.  On Intel, we have some resources that
have to be pinned to VRAM and can't be dynamically swapped out by the
kernel.  In GL, we probably can deal with it somewhat dynamically.  In
Vulkan, we'll be entirely dependent on the application to use the
appropriate Vulkan memory budget APIs.

--Jason

3.3. Buffer destruction without per-BO fences

When the buffer destroy ioctl is called, an optional fence list can
be passed to the kernel to indicate when it's safe to deallocate the
buffer. If the fence list is empty, the buffer will be deallocated
immediately. Shared buffers will be handled by merging fence lists from all
processes that destroy them. Mitigation of malicious behavior:
- If userspace destroys a busy buffer, it will get a GPU page fault.
- If userspace sends fences that never signal, the kernel will have a
timeout period and then will proceed to deallocate the buffer anyway.
3.4. Other notes on MM

Overcommitment of GPU-accessible memory will cause an allocation
failure or invoke the OOM killer. Evictions to GPU-inaccessible memory
might not be supported.
Kernel drivers could move to this new memory management today. Only
buffer residency and evictions would stop using per-BO fences.

4. Deprecating implicit synchronization

It can be phased out by introducing a new generation of hardware
where the driver doesn't add support for it (like a driver fork would do),
assuming userspace has all the changes for explicit synchronization. This
could potentially create an isolated part of the kernel DRM where all
drivers only support explicit synchronization.
Marek
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
mesa-dev mailing list
mesa-dev@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel