Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

Daniel Stone <daniel@xxxxxxxxxxxxx> · Tue, 20 Apr 2021 13:42:26 +0100

Hi Marek,

On Mon, 19 Apr 2021 at 11:48, Marek Olšák <maraeo@xxxxxxxxx> wrote:
2. Explicit synchronization for window systems and modesetting

The producer is an application and the consumer is a compositor or a modesetting driver.

2.1. The Present request

So the 'present request' is an ioctl, right? Not a userspace construct like it is today? If so, how do we correlate the two?

The terminology is pretty X11-centric so I'll assume that's what you've designed against, but Wayland and even X11 carry much more auxiliary information attached to a present request than just 'this buffer, this swapchain'. Wayland latches a lot of data on presentation, including non-graphics data such as surface geometry (so we can have resizes which don't suck), window state (e.g. fullscreen or not, also so we can have resizes which don't suck), and these requests can also cascade through a tree of subsurfaces (so we can have embeds which don't suck). X11 mostly just carries timestamps, which is more tractable.

Given we don't want to move the entirety of Wayland into kernel-visible objects, how do we synchronise the two streams so they aren't incoherent? Taking a rough stab at it whilst assuming we do have DRM_IOCTL_NONMODE_PRESENT, this would create a present object somewhere in kernel space, which the producer would create and ?? export a FD from, that the compositor would ?? import.

As part of the Present request, the producer will pass 2 fences (sync objects) to the consumer alongside the presented DMABUF BO:
- The submit fence: Initially unsignalled, it will be signalled when the producer has finished drawing into the presented buffer.

We have already have this in Wayland through dma_fence. I'm relaxed about this becoming drm_syncobj or drm_newmappedysncobjthing, it's just a matter of typing. X11 has patches to DRI3 to support dma_fence, but they never got merged because it was far too invasive to a server which is no longer maintained.

- The return fence: Initially unsignalled, it will be signalled when the consumer has finished using the presented buffer.

Currently in Wayland the return fence (again a dma_fence) is generated by the compositor and sent as an event when it's done, because we can't have speculative/empty/future fences. drm_syncobj would make this possible, but so far I've been hesitant because I don't see the benefit to it (more below).

Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which fence. This information is part of the Present request and supplied by userspace.

Same as today with dma_fence. Less true with drm_syncobj if we're using timelines.

- If the producer crashes, the kernel signals the submit fence, so that the consumer can make forward progress.

This is only a change if the producer is now allowed to submit a fence before it's flushed the work which would eventually fulfill that fence. Using dma_fence has so far isolated us from this.

- If the consumer crashes, the kernel signals the return fence, so that the producer can reclaim the buffer.

'The consumer' is problematic, per below. I think the wording you want is 'if no references are held to the submitted present object'.

- A GPU hang signals all fences. Other deadlocks will be handled like GPU hangs.

Other window system requests can follow the same idea.

Which other window system requests did you have in mind? Again, moving the entirety of Wayland's signaling into the kernel is a total non-starter. Partly because it means our entire protocol would be subject to the kernel's ABI rules, partly because the rules and interdependencies between the requests are extremely complex, but mostly because the kernel is just a useless proxy: it would be forced to do significant work to reason about what those requests do and when they should happen, but wouldn't be able to make those decisions itself so would have to just punt everything to userspace. Unless we have eBPF compositors.

Merged fences where one fence object contains multiple fences will be supported. A merged fence is signalled only when its fences are signalled. The consumer will have the option to redefine the unsignalled return fence to a merged fence.

An elaboration of how this differed from drm_syncobj would be really helpful here. I can make some guesses based on the rest of the mail, but I'm not sure how accurate they are.

2.2. Modesetting

Since a modesetting driver can also be the consumer, the present ioctl will contain a submit fence and a return fence too.  One small problem with this is that userspace can hang the modesetting 
driver, but in theory, any later present ioctl can override the previous one, so the unsignalled presentation is never used.

This is also problematic. It's not just KMS, but media codecs too - V4L doesn't yet have explicit fencing, but given the programming model of codecs and how deeply they interoperate, but it will.

Rather than client (GPU) -> compositor (GPU) -> compositor (KMS), imagine you're playing a Steam game on your Chromebook which you're streaming via Twitch or whatever. The full chain looks like:
* Steam game renders with GPU
* Xwayland in container receives dmabuf, forwards dmabuf to Wayland server (does not directly consume)
* Wayland server (which is actually Chromium) receives dmabuf, forwards dmabuf to Chromium UI process
* Chromium UI process forwards client dmabuf to KMS for direct scanout
* Chromium UI process _also_ forwards client dmabuf to GPU process
* Chromium GPU process composites Chromium UI + client dmabuf + webcam frame from V4L to GPU composition job
* Chromium GPU process forwards GPU composition dmabuf (not client dmabuf) to media codec for streaming

So, we don't have a 1:1 producer:consumer relationship. Even if we accept it's 1:n, your Chromebook is about to burst into flames and we're dropping frames to try to keep up. Some of the consumers are FIFO (the codec wants to push things through in order), and some of them are mailbox (the display wants to get the latest content, not from half a second ago before the other player started jumping around and now you're dead). You can't reason about any of these dependencies ahead of time from a producer PoV, because userspace will be making these decisions frame by frame. Also someone's started using the Vulkan present-timing extension because life wasn't confusing enough already.

As Christian and Daniel were getting at, there are also two 'levels' of explicit synchronisation.

The first (let's call it 'blind') is plumbing a dma_fence through to be passed with the dmabuf. When the client submits a buffer for presentation, it submits a dma_fence as well. When the compositor is finished with it (i.e. has flushed the last work which will source from that buffer), it passes a dma_fence back to the client, or no fence if required (buffer was never accessed, or all accesses are known to be fully retired e.g. the last fence accessing it has already signaled). This is just a matter of typing, and is supported by at least Weston. It implies no scheduling change over implicit fencing in that the compositor can be held hostage by abusive clients with a really long compute shader in their dependency chain: all that's happening is that we're plumbing those synchronisation tokens through userspace instead of having the kernel dig them up from dma_resv. But we at least have a no-deadlock guarantee, because a dma_fence will complete in bounded time.

The second (let's call it 'smart') is ... much more than that. Not only does the compositor accept and generate explicit synchronisation points for the client, but those synchronisation points aren't dma_fences, but may be wait-before-signal, or may be wait-never-signal. So in order to avoid a terminal deadlock, the compositor has to sit on every synchronisation point and check before it flushes any dependent work that it has signaled, or will at least signal in bounded time. If that guarantee isn't there, you have to punt and see if anything happens at your next repaint point. We don't currently have this support in any compositor, and it's a lot more work than blind.

Given the interdependencies I've described above for Wayland - say a resize case, or when a surface commit triggers a cascade of subsurface commits - GPU-side conditional rendering is not always possible. In those cases, you _must_ do CPU-side waits and keep both sets of state around. Pain.

Typing all that out has convinced me that the current proposal is a net loss in every case.

Complex rendering uses (game engine with a billion draw calls, a billion BOs, complex sync dependencies, wait-before-signal and/or conditional rendering/descriptor indexing) don't need the complexity of a present ioctl and checking whether other processes have crashed or whatever. They already have everything plumbed through for this themselves, and need to implement so much infrastructure around it that they don't need much/any help from the kernel. Just give them a sync primitive with almost zero guarantees that they can map into CPU & GPU address space, let them go wild with it. drm_syncobj_plus_footgun. Good luck.

Simple presentation uses (desktop, browser, game) don't need the hyperoptimisation of sync primitives. Frame times are relatively long, and you can only have so many surfaces which aren't occluded. Either you have a complex scene to composite, in which case the CPU overhead of something like dma_fence is lower than the CPU overhead required to walk through a single compositor repaint cycle anyway, or you have a completely trivial scene to composite and you can absolutely eat the overhead of exporting and scheduling like two fences in 10ms.

Complex presentation uses (out-streaming, media sources, deeper presentation chains) make the trivial present ioctl so complex that its benefits evaporate. Wait-before-signal pushes so much complexity into the compositor that you have to eat a lot of CPU overhead there and lose your ability to do pipelined draws because you have to hang around and see if they'll ever complete. Cross-device usage means everyone just ends up spinning on the CPU instead.

So, can we take a step back? What are the problems we're trying to solve? If it's about optimising the game engine's internal rendering, how would that benefit from a present ioctl instead of current synchronisation?

If it's about composition, how do we balance the complexity between the kernel and userspace? What's the global benefit from throwing our hands in the air and saying 'you deal with it' to all of userspace, given that existing mailbox systems making frame-by-frame decisions already preclude deep/speculative pipelining on the client side?

Given that userspace then loses all ability to reason about presentation if wait-before-signal becomes a thing, do we end up with a global performance loss by replacing the overhead of kernel dma_fence handling with userspace spinning on a page? Even if we micro-optimise that by allowing userspace to be notified on access, is the overhead of pagefault -> kernel signal handler -> queue signalfd notification -> userspace event loop -> read page & compare to expected value, actually better than dma_fence?

Cheers,
Daniel 
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel