Hey, On Wed, 26 May 2021 at 17:53, Daniel Vetter <daniel@xxxxxxxx> wrote: > On Wed, May 26, 2021 at 5:13 PM Daniel Stone <daniel@xxxxxxxxxxxxx> wrote: > > > Shared is shared, I just meant to say that we always add the shared fence. > > > So an explicit ioctl to add more shared fences is kinda pointless. > > > > > > So yeah on a good driver this will run in parallel. On a not-so-good > > > driver (which currently includes amdgpu and panfrost) this will serialize, > > > because those drivers don't have the concept of a non-exclusive fence for > > > such shared buffers (amdgpu does not sync internally, but will sync as > > > soon as it's cross-drm_file). > > > > When you say 'we always add the shared fence', add it to ... where? > > And which shared fence? (I'm going to use 'fence' below to refer to > > anything from literal sync_file to timeline-syncobj to userspace > > fence.) > > In the current model, every time you submit anything to the gpu, we > create a dma_fence to track this work. This dma_fence is attached as a > shared fence to the dma_resv obj of every object in your working set. > Clarifications > you = both userspace or kernel, anything really, including fun stuff > like writing PTEs, or clearing PTEs and then flushing TLBs > working set = depends, but can be anything from "really just the > buffers the current gpu submission uses" to "everything bound into a > given gpu VM" > > This is the fence I'm talking about here. > > Since you can't escape this (not unless we do direct userspace submit > with userspace memory fences) and since there's no distinction of the > shared fences into "relevant for implicit sync" and "not relevant for > implicit sync" there's really not much point in adding implicit read > fences. For now at least, we might want to change this eventually. Yeah, I agree. My own clarification is that I'm talking about an explicit-first world, where synchronisation is done primarily through unknowable UMF, and falling back to implicit sync is a painful and expensive operation that we only do when we need to. So, definitely not on every CS (command submission aka execbuf aka vkQueueSubmit aka glFlush). > > I'll admit that I've typed out an argument twice for always export > > from excl+shared, and always import to excl, results in oversync. And > > I keep tying myself in knots trying to do it. It's arguably slightly > > contrived, but here's my third attempt ... > > > > Vulkan Wayland client, full-flying-car-sync Wayland protocol, > > Vulkan-based compositor. Part of the contract when the server exposes > > that protocol is that it guarantees to do explicit sync in both > > directions, so the client provides a fence at QueueSubmit time and the > > server provides one back when releasing the image for return to ANI. > > Neither side ever record fences into the dma_resv because they've > > opted out by being fully explicit-aware. > > > > Now add media encode out on the side because you're streaming. The > > compositor knows this is a transition between explicit and implicit > > worlds, so it imports the client's fence into the exclusive dma_resv > > slot, which makes sense: the media encode has to sync against the > > client work, but is indifferent to the parallel compositor work. The > > shared fence is exported back out so the compositor can union the > > encode-finished fence with its composition-finished fence to send back > > to the client with release/ANI. > > > > Now add a second media encode because you want a higher-quality local > > capture to upload to YouTube later on. The compositor can do the exact > > same import/export dance, and the two encodes can safely run in > > parallel. Which is good. > > So the example which works is really clear ... > > > Where it starts to become complex is: what if your compositor is fully > > explicit-aware but your clients aren't, so your compositor has more > > import/export points to record into the resv? What if you aren't > > actually a compositor but a full-blown media pipeline, where you have > > a bunch of threads all launching reads in parallel, to the extent > > where it's not practical to manage implicit/explicit transitions > > globally, but each thread has to more pessimistically import and > > export around each access? > > ... but the example where we oversync is hand-waving? > > :-P Hey, I said I tied myself into knots! Maybe it's because my brain is too deeply baked into implicit sync, maybe it's because the problem cases aren't actually problems. Who knows. I think what it comes down to is that we make it workable for (at least current-generation, before someone bakes it into Unity) Wayland compositors to work well with these modal switches, but really difficult for more complex and variable pipeline frameworks like GStreamer or PipeWire to work with it. > > I can make the relatively simple usecases work, but it really feels > > like in practice we'll end up with massive oversync in some fairly > > complex usecases, and we'll regret not having had it from the start, > > plus people will just rely on implicit sync for longer because it has > > better (more parallel) semantics in some usecases. > > Things fall apart in implicit sync if you have more than one logical > writer into the same buffer. Trivial example is two images in one > buffer, but you could also do funky stuff like interleaved/tiled > rendering with _indepedent_ consumers. If the consumers are not > independent, then you can again just stuff the two writer fences into > the exclusive slot with the new ioctl (they'll get merged without > additional overhead into one fence array fence). > > And the fundamental thing is: This is just not possible with implicit > sync. There's only one fence slot (even if that resolves to an array > of fences for all the producers), so anytime you do multiple > independent things in the same buffer you either: > - must split the buffers so there's again a clear&unique handoff at > each stage of the pipeline > - or use explicit sync Yeah no argument, this doesn't work & can't work in implicit sync. But what I'm talking about is having a single writer (serialised) and multiple readers (in parallel). Readers add to the shared slot, serialising their access against all earlier exclusive fences, and writers add to the exclusive slot, serialising their access against all earlier fences (both exclusive and shared). So if import can only add to the exclusive slot, then we can end up potentially serialising readers against each other. We want readers to land in the shared slot to be able to parallelise against each other and let writers serialise after them, no? > So in your example, options are > - per-client buffers, which you then blend into a composite buffer to > handle the N implicit fences from N buffers into a single implicit > fence for libva conversion. This single buffer then also allows you to > again fan out to M libva encoders, or whatever it is that you fancy > - explicit fencing and clients render into a single buffer with no > copying, and libva encodes from that single buffer (but again needs > explicit fences or it all comes crashing down) > > There's really no option C where you somehow do multiple implicitly > fenced things into a single buffer and expect it to work out in > parallel. All of my examples above are a single client buffer (GPU source which places a fence into the exclusive slot for when the colour buffer contents are fully realised), just working its way through multiple stages and APIs. Like, your single Dota2 window ends up in a Vulkan-based Wayland compositor, a pure VA-API encode stream to write high-quality AV1 to disk, and also an EGL pipeline which overlays your awesome logo and webcam stream before VA-API encoding to a lower-quality H.264 stream for Twitch. This isn't a convoluted example, it's literally what the non-geriatric millennials do all day. It's a lot of potential boundaries between implicit & explicit world, and if we've learned one thing from modifiers it's that we probably shouldn't underthink the boundaries. So: 1. Does every CS generate the appropriate resv entries (exclusive for write, shared for read) for every access to every buffer? I think the answer has to be no, because it's not necessarily viable in future. 2. If every CS doesn't generate the appropriate resv entries, do we go for the middle ground where we keep interactions with implicit sync implicit (e.g. every client API accessing any externally-visible BO populates the appropriate resv slot, but internal-only buffers get to skip it), or do we surface them and make it explicit (e.g. the Wayland explicit-sync protocol is a contract between client/compositor that the client doesn't have to populate the resv slots, because the compositor will ensure every access it makes is appropriate synchronised)? I think the latter, because the halfway house sounds really painful for questionable if any benefit, and makes it maybe impossible for us to one day deprecate implicit. 3. If we do surface everything and make userspace handle the implicit/explicit boundaries, do we make every explicit -> implicit boundary (via the import ioctl) populate the exclusive slot or allow it to choose? I think allow it to choose, because I don't understand what the restriction buys us. 4. Can the combination of dynamic modifier negotiation and explicit synchronisation let us deliver the EGLStreams promise before EGLStreams can? :) Cheers, Daniel