Re: Plumbing explicit synchronization through the Linux ecosystem

Nicolas Dufresne <nicolas@xxxxxxxxxxxx> · Tue, 17 Mar 2020 14:21:38 -0400

Le mardi 17 mars 2020 à 11:27 -0500, Jason Ekstrand a écrit :
> On Tue, Mar 17, 2020 at 10:33 AM Nicolas Dufresne <nicolas@xxxxxxxxxxxx> wrote:
> > Le lundi 16 mars 2020 à 23:15 +0200, Laurent Pinchart a écrit :
> > > Hi Jason,
> > > 
> > > On Mon, Mar 16, 2020 at 10:06:07AM -0500, Jason Ekstrand wrote:
> > > > On Mon, Mar 16, 2020 at 5:20 AM Laurent Pinchart wrote:
> > > > > On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:
> > > > > > (I know I'm going to be spammed by so many mailing list ...)
> > > > > > 
> > > > > > Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :
> > > > > > > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <jason@xxxxxxxxxxxxxx> wrote:
> > > > > > > > All,
> > > > > > > > 
> > > > > > > > Sorry for casting such a broad net with this one. I'm sure most people
> > > > > > > > who reply will get at least one mailing list rejection.  However, this
> > > > > > > > is an issue that affects a LOT of components and that's why it's
> > > > > > > > thorny to begin with.  Please pardon the length of this e-mail as
> > > > > > > > well; I promise there's a concrete point/proposal at the end.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Explicit synchronization is the future of graphics and media.  At
> > > > > > > > least, that seems to be the consensus among all the graphics people
> > > > > > > > I've talked to.  I had a chat with one of the lead Android graphics
> > > > > > > > engineers recently who told me that doing explicit sync from the start
> > > > > > > > was one of the best engineering decisions Android ever made.  It's
> > > > > > > > also the direction being taken by more modern APIs such as Vulkan.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ## What are implicit and explicit synchronization?
> > > > > > > > 
> > > > > > > > For those that aren't familiar with this space, GPUs, media encoders,
> > > > > > > > etc. are massively parallel and synchronization of some form is
> > > > > > > > required to ensure that everything happens in the right order and
> > > > > > > > avoid data races.  Implicit synchronization is when bits of work (3D,
> > > > > > > > compute, video encode, etc.) are implicitly based on the absolute
> > > > > > > > CPU-time order in which API calls occur.  Explicit synchronization is
> > > > > > > > when the client (whatever that means in any given context) provides
> > > > > > > > the dependency graph explicitly via some sort of synchronization
> > > > > > > > primitives.  If you're still confused, consider the following
> > > > > > > > examples:
> > > > > > > > 
> > > > > > > > With OpenGL and EGL, almost everything is implicit sync.  Say you have
> > > > > > > > two OpenGL contexts sharing an image where one writes to it and the
> > > > > > > > other textures from it.  The way the OpenGL spec works, the client has
> > > > > > > > to make the API calls to render to the image before (in CPU time) it
> > > > > > > > makes the API calls which texture from the image.  As long as it does
> > > > > > > > this (and maybe inserts a glFlush?), the driver will ensure that the
> > > > > > > > rendering completes before the texturing happens and you get correct
> > > > > > > > contents.
> > > > > > > > 
> > > > > > > > Implicit synchronization can also happen across processes.  Wayland,
> > > > > > > > for instance, is currently built on implicit sync where the client
> > > > > > > > does their rendering and then does a hand-off (via wl_surface::commit)
> > > > > > > > to tell the compositor it's done at which point the compositor can now
> > > > > > > > texture from the surface.  The hand-off ensures that the client's
> > > > > > > > OpenGL API calls happen before the server's OpenGL API calls.
> > > > > > > > 
> > > > > > > > A good example of explicit synchronization is the Vulkan API.  There,
> > > > > > > > a client (or multiple clients) can simultaneously build command
> > > > > > > > buffers in different threads where one of those command buffers
> > > > > > > > renders to an image and the other textures from it and then submit
> > > > > > > > both of them at the same time with instructions to the driver for
> > > > > > > > which order to execute them in.  The execution order is described via
> > > > > > > > the VkSemaphore primitive.  With the new VK_KHR_timeline_semaphore
> > > > > > > > extension, you can even submit the work which does the texturing
> > > > > > > > BEFORE the work which does the rendering and the driver will sort it
> > > > > > > > out.
> > > > > > > > 
> > > > > > > > The #1 problem with implicit synchronization (which explicit solves)
> > > > > > > > is that it leads to a lot of over-synchronization both in client space
> > > > > > > > and in driver/device space.  The client has to synchronize a lot more
> > > > > > > > because it has to ensure that the API calls happen in a particular
> > > > > > > > order.  The driver/device have to synchronize a lot more because they
> > > > > > > > never know what is going to end up being a synchronization point as an
> > > > > > > > API call on another thread/process may occur at any time.  As we move
> > > > > > > > to more and more multi-threaded programming this synchronization (on
> > > > > > > > the client-side especially) becomes more and more painful.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ## Current status in Linux
> > > > > > > > 
> > > > > > > > Implicit synchronization in Linux works via a the kernel's internal
> > > > > > > > dma_buf and dma_fence data structures.  A dma_fence is a tiny object
> > > > > > > > which represents the "done" status for some bit of work.  Typically,
> > > > > > > > dma_fences are created as a by-product of someone submitting some bit
> > > > > > > > of work (say, 3D rendering) to the kernel.  The dma_buf object has a
> > > > > > > > set of dma_fences on it representing shared (read) and exclusive
> > > > > > > > (write) access to the object.  When work is submitted which, for
> > > > > > > > instance renders to the dma_buf, it's queued waiting on all the fences
> > > > > > > > on the dma_buf and and a dma_fence is created representing the end of
> > > > > > > > said rendering work and it's installed as the dma_buf's exclusive
> > > > > > > > fence.  This way, the kernel can manage all its internal queues (3D
> > > > > > > > rendering, display, video encode, etc.) and know which things to
> > > > > > > > submit in what order.
> > > > > > > > 
> > > > > > > > For the last few years, we've had sync_file in the kernel and it's
> > > > > > > > plumbed into some drivers.  A sync_file is just a wrapper around a
> > > > > > > > single dma_fence.  A sync_file is typically created as a by-product of
> > > > > > > > submitting work (3D, compute, etc.) to the kernel and is signaled when
> > > > > > > > that work completes.  When a sync_file is created, it is guaranteed by
> > > > > > > > the kernel that it will become signaled in finite time and, once it's
> > > > > > > > signaled, it remains signaled for the rest of time.  A sync_file is
> > > > > > > > represented in UAPIs as a file descriptor and can be used with normal
> > > > > > > > file APIs such as dup().  It can be passed into another UAPI which
> > > > > > > > does some bit of queue'd work and the submitted work will wait for the
> > > > > > > > sync_file to be triggered before executing.  A sync_file also supports
> > > > > > > > poll() if  you want to wait on it manually.
> > > > > > > > 
> > > > > > > > Unfortunately, sync_file is not broadly used and not all kernel GPU
> > > > > > > > drivers support it.  Here's a very quick overview of my understanding
> > > > > > > > of the status of various components (I don't know the status of
> > > > > > > > anything in the media world):
> > > > > > > > 
> > > > > > > >  - Vulkan: Explicit synchronization all the way but we have to go
> > > > > > > > implicit as soon as we interact with a window-system.  Vulkan has APIs
> > > > > > > > to import/export sync_files to/from it's VkSemaphore and VkFence
> > > > > > > > synchronization primitives.
> > > > > > > >  - OpenGL: Implicit all the way.  There are some EGL extensions to
> > > > > > > > enable some forms of explicit sync via sync_file but OpenGL itself is
> > > > > > > > still implicit.
> > > > > > > >  - Wayland: Currently depends on implicit sync in the kernel (accessed
> > > > > > > > via EGL/OpenGL).  There is an unstable extension to allow passing
> > > > > > > > sync_files around but it's questionable how useful it is right now
> > > > > > > > (more on that later).
> > > > > > > >  - X11: With present, it has these "explicit" fence objects but
> > > > > > > > they're always a shmfence which lets the X server and client do a
> > > > > > > > userspace CPU-side hand-off without going over the socket (and
> > > > > > > > round-tripping through the kernel).  However, the only thing that
> > > > > > > > fence does is order the OpenGL API calls in the client and server and
> > > > > > > > the real synchronization is still implicit.
> > > > > > > >  - linux/i915/gem: Fully supports using sync_file or syncobj for explicit
> > > > > > > > sync.
> > > > > > > >  - linux/amdgpu: Supports sync_file and syncobj but it still
> > > > > > > > implicitly syncs sometimes due to it's internal memory residency
> > > > > > > > handling which can lead to over-synchronization.
> > > > > > > >  - KMS: Implicit sync all the way.  There are no KMS APIs which take
> > > > > > > > explicit sync primitives.
> > > > > > > 
> > > > > > > Correction:  Apparently, I missed some things.  If you use atomic, KMS
> > > > > > > does have explicit in- and out-fences.  Non-atomic users (e.g. X11)
> > > > > > > are still in trouble but most Wayland compositors use atomic these
> > > > > > > days
> > > > > > > 
> > > > > > > >  - v4l: ???
> > > > > > > >  - gstreamer: ???
> > > > > > > >  - Media APIs such as vaapi etc.:  ???
> > > > > > 
> > > > > > GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer
> > > > > > synchronisation is something we do already with GL (even if limited). We place
> > > > > > GLSync object in the pipeline and attach that on related GstBuffer. We wait on
> > > > > > these GLSync as late as possible (or superseed the sync if we queue more work
> > > > > > into the same GL context). That requires a special mode of operation of course.
> > > > > > We don't usually like making lazy blocking call implicit, as it tends to cause
> > > > > > random issues. If we need to wait, we think it's better to wait int he module
> > > > > > that is responsible, so in general, we try to negotiate and fallback locally
> > > > > > (it's plugin base, so this can be really messy otherwise).
> > > > > > 
> > > > > > So basically this problem needs to be solved in V4L2, VAAPI and other lower
> > > > > > level APIs first. We need API that provides us these fence (in or out), and then
> > > > > > we can consider using them. For V4L2, there was an attempt, but it was a bit of
> > > > > > a miss-fit. Your proposal could work, need to be tested I guess, but it does not
> > > > > > solve some of other issues that was discussed. Notably for camera capture, were
> > > > > > the HW timestamp is capture about at the same time the frame is ready. But the
> > > > > > timestamp is not part of the paylaod, so you need an entire API asynchronously
> > > > > > deliver that metadata. It's the biggest pain point I've found, such an API would
> > > > > > be quite invasive or if made really generic, might just never be adopted widely
> > > > > > enough.
> > > > > 
> > > > > Another issue is that V4L2 doesn't offer any guarantee on job ordering.
> > > > > When you queue multiple buffers for camera capture for instance, you
> > > > > don't know until capture complete in which buffer the frame has been
> > > > > captured.
> > > > 
> > > > Is this a Kernel UAPI issue?  Surely the kernel driver knows at the
> > > > start of frame capture which buffer it's getting written into.  I
> > > > would think that the kernel APIs could be adjusted (if we find good
> > > > reason to do so!) such that they return earlier and return a (buffer,
> > > > fence) pair.  Am I missing something fundamental about video here?
> > > 
> > > For cameras I believe we could do that, yes. I was pointing out the
> > > issues caused by the current API. For video decoders I'll let Nicolas
> > > answer the question, he's way more knowledgeable that I am on that
> > > topic.
> > 
> > Right now, there is simply no uAPI for supporting asynchronous errors
> > reporting when fences are invovled. That is true for both camera's and
> > CODEC. It's likely what all the attempt was missing, I don't know
> > enough myself to suggest something.
> > 
> > Now, why Stateless video decoders are special is another subject. In
> > CODECs, the decoding and the presentation order may differ. For
> > Stateless kind of CODEC, a bitstream is passed to the HW. We don't know
> > if this bitstream is fully valid, since the it is being parsed and
> > validated by the firmware. It's also firmware job to decide which
> > buffer should be presented first.
> > 
> > In most firmware interface, that information is communicated back all
> > at once when the frame is ready to be presented (which may be quite
> > some time after it was decoded). So indeed, a fence model is not really
> > easy to add, unless the firmware was designed with that model in mind.
> 
> Just to be clear, I think we should do whatever makes sense here and
> not try to slam sync_file in when it doesn't make sense just because
> we have it.  The more I read on this thread, the less out-fences from
> video decode sound like they make sense unless we have a really solid
> plan for async error reporting.  It's possible, depending on how many
> processes are involved in the pipeline, that async error reporting
> could help reduce latency a bit if it let the kernel report the error
> directly to the last process in the chain.  However, I'm not convinced
> the potential for userspace programmer error is worth it..  That said,
> I'm happy to leave that up to the actual video experts. (I just do 3D)
> 
> > Nothing of course would prevent V4L2 framework to generically handle
> > out_fence from other producers. It does not even handle implicit fences
> > at the moment, which is already quite problematic (I've seen glitches
> > on i.MX6/8 and Raspberry Pi 4).
> > 
> > In that specific case, if the fences from etnaviv, vc graphic drivers
> > was exposed, we could solve this issue in userspace. Right now it's
> > implicit, so we rely on all DMABuf driver to have proper support, which
> > is not the case. There is V4L2 support for that coming, but the wait is
> > done synchronously in userspace call that was normally non-blocking. So
> > that is unlikely to fly.
> 
> Yeah... waits in userspace aren't what anyone wants.
> 
> > Small note, stateless video decoders don't have this issue. The
> > bitstream is validated by userspace, and userspace controls the
> > "decode" operation. This one would be a good case for bidirectional
> > fencing.
> 
> Good to know.
> 
> > > > I must admit that V4L is a bit of an odd case since the kernel driver
> > > > is the producer and not the consumer.
> > > 
> > > Note that V4L2 can be a consumer too. Video output with V4L2 is less
> > > frequent than video capture (but it still exists), and codecs and other
> > > memory-to-memory processing devices (colorspace converters, scalers,
> > > ...) are both consumers and producers.
> > > 
> > > > > In the normal case buffers are processed in sequence, but if
> > > > > an error occurs during capture, they can be recycled internally and put
> > > > > to the back of the queue.
> > > > 
> > > > Are those errors something that can happen at any time in the middle
> > > > of a frame capture?  If so, that does make things stickier.
> > > 
> > > Yes it can. Think of packet loss when capturing from a USB webcam for
> > > instance.
> > > 
> > > > > Unless I'm mistaken, this problem also exists
> > > > > with stateful codecs. And if you don't know in advance which buffer you
> > > > > will receive from the device, the usefulness of fences becomes very
> > > > > questionable :-)
> > > > 
> > > > Yeah, if you really are in a situation where there's no way to know
> > > > until the full frame capture has been completed which buffer is next,
> > > > then fences are useless.  You aren't in an implicit synchronization
> > > > setting either; you're in a "full flush" setting.  It's arguably worse
> > > > for performance but perhaps unavoidable?
> > > 
> > > Probably unavoidable in some cases, but nothing that should get in the
> > > way for the discussion at hand: there's no need to migrate away from
> > > implicit sync when there's implicit sync in the first place :-)
> > > 
> > > I think we need to analyse the use cases here, and figure out at least
> > > guidelines for userspace, otherwise applications will wonder what
> > > behaviour to implement, and we'll end up with a wide variety of them.
> > > Even just on the kernel side, some V4L2 capture driver will pass
> > > erroneous frames to userspace (thus guaranteeing ordering, but without
> > > early notification of errors), some will require the frame
> > > automatically, and at least one (uvcvideo) has a module parameter to
> > > pick the desired behaviour.
> > 
> > Also, from a userspace point of view, the synchronization with the
> > "next frame" in V4L2 isn't implicit. We can poll() the device, just
> > like we'd do with a fence FD. What the explicit fence gives, is a
> > unified object we can pass to another driver, or other userspace, so we
> > can delegate the wait.
> > 
> > You refer to performance in few places. In streaming, this is often
> > measure as real-time throughput. Implicit/explicit fences don't really
> > play any role for us in this regard. V4L2 drivers, like m2m drivers,
> > works with buffer queues. So you can queue in advance many buffers on
> > the OUTPUT device side (which is the input of the m2m), and userspace
> > will queue in advance pretty much all free buffers available on the
> > CAPTURE side. The driver is never starved in that model, at the cost of
> > very large memory consumption of course. Maybe a more visual
> > representation would be:
> > 
> >   [pending job] -> [M2M Worker] -> [pending results]
> > 
> > So as long as userspace keep the pending job queue non-empty, and that
> > it consumes and give back buffers back to write the results into, the
> > driver will keep running un-interrupted. Performance remains optimal.
> > What isn't optimal is the latency. And what bugs right now is when a
> > DMAbuf implicit out fence is put back into the pending results queue,
> > since the fence is ignored.
> 
> Yes, that makes sense.  In 3D land, we're very concerned about
> latency.  Any time anyone has to stall for anything, it's a potential
> hitch in someone's game.  Being delayed by a single extra frame can be
> problematic; 2-3 frames puts the gamer at a significant disadvantage.
> In video, as long as audio and video are in sync and you aren't
> dropping frames, no one really cares about latency as long as hitting
> the pause button doesn't take too long.

Just a note, there exist low latency use cases for streaming too (sub-
frame latency between two devices). But everything I'm ware is
downstream. The one I have in mind uses a special AXI feature to
synchronize between two HW component, but the implementation is not
using either implicit or explicit fence, in fact they didn't bother
adding a specific kernel object, you have to know when you use these
downstream drivers. We are a bit far from being able to make generic
software on top of that.

The use case was less prone to capture error, since instead of a
camera, they have SDI or HDMI receiver.

> 
> What concerns me the most, I think is actually the interop issues.
> You mentioned issues with the raspberry pi.  Right now, if someone is
> rendering frames using a Vulkan driver and trying to pass those on to
> V4L for encode or to some other api such as VA-API, we don't really
> have a plan for synchronization.  Thanks to dma-buf extensions we at
> least have most of a plan for sharing the memory and negotiating image
> layouts (strides, tiling, etc.) but no plan for synchronization at

I didn't know there was plan for that, this is nice. Right now every
userspace carry this information in a slightly different and
incompatible way, translating, extrapolation, etc. It's all very error
prone.

> all.  The only thing you can do today is to use a VkFence to CPU wait
> for the 3D rendering to be 100% done and then pass the image on to the
> encoder.
> 
> The more I look over the various hacks we've done over the course of
> the last 4 years to make window systems work, the less confident I am
> that I want to expose ANY of them as an official Vulkan extension that
> we support long-term.  The one we do have which I'm reasonably happy
> to be stuck with is sync_file import/export.  That said, it's sounding
> like V4L doesn't support dma-buf implicit sync at all so maybe CPU
> waiting with a VkFence is the current state-of-the-art?
> 
> --Jason
> 
> 
> > > > Trying to understand. :-)
> > > 
> > > So am I :-)
> > 
> > Hehe, same here.
> > 
> > > > > > There is other elements that would implement fencing, notably kmssink, but no
> > > > > > one actually dared porting it to atomic KMS, so clearly there is very little
> > > > > > comunity interest. glimagsink could clearly benifit. Right now if we import a
> > > > > > DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,
> > > > > > which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would
> > > > > > wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer
> > > > > > uses), so then the operation will just fail where it worked before (breaking
> > > > > > userspace). If it was an explcit fence, we could handle that in GStreamer
> > > > > > cleanly as we do for new APIs.
> > > > > > 
> > > > > > > > ## Chicken and egg problems
> > > > > > > > 
> > > > > > > > Ok, this is where it starts getting depressing.  I made the claim
> > > > > > > > above that Wayland has an explicit synchronization protocol that's of
> > > > > > > > questionable usefulness.  I would claim that basically any bit of
> > > > > > > > plumbing we do through window systems is currently of questionable
> > > > > > > > usefulness.  Why?
> > > > > > > > 
> > > > > > > > From my perspective, as a Vulkan driver developer, I have to deal with
> > > > > > > > the fact that Vulkan is an explicit sync API but Wayland and X11
> > > > > > > > aren't.  Unfortunately, the Wayland extension solves zero problems for
> > > > > > > > me because I can't really use it unless it's implemented in all of the
> > > > > > > > compositors.  Until every Wayland compositor I care about my users
> > > > > > > > being able to use (which is basically all of them) supports the
> > > > > > > > extension, I have to continue carry around my pile of hacks to keep
> > > > > > > > implicit sync and Vulkan working nicely together.
> > > > > > > > 
> > > > > > > > From the perspective of a Wayland compositor (I used to play in this
> > > > > > > > space), they'd love to implement the new explicit sync extension but
> > > > > > > > can't.  Sure, they could wire up the extension, but the moment they go
> > > > > > > > to flip a client buffer to the screen directly, they discover that KMS
> > > > > > > > doesn't support any explicit sync APIs.
> > > > > > > 
> > > > > > > As per the above correction, Wayland compositors aren't nearly as bad
> > > > > > > off as I initially thought.  There may still be weird screen capture
> > > > > > > cases but the normal cases of compositing and displaying via
> > > > > > > KMS/atomic should be in reasonably good shape.
> > > > > > > 
> > > > > > > > So, yes, they can technically
> > > > > > > > implement the extension assuming the EGL stack they're running on has
> > > > > > > > the sync_file extensions but any client buffers which come in using
> > > > > > > > the explicit sync Wayland extension have to be composited and can't be
> > > > > > > > scanned out directly.  As a 3D driver developer, I absolutely don't
> > > > > > > > want compositors doing that because my users will complain about
> > > > > > > > performance issues due to the extra blit.
> > > > > > > > 
> > > > > > > > Ok, so let's say we get KMS wired up with implicit sync.  That solves
> > > > > > > > all our problems, right?  It does, right up until someone decides that
> > > > > > > > they wan to screen capture their Wayland session via some hardware
> > > > > > > > media encoder that doesn't support explicit sync.  Now we have to
> > > > > > > > plumb it all the way through the media stack, gstreamer, etc.  Great,
> > > > > > > > so let's do that!  Oh, but gstreamer won't want to plumb it through
> > > > > > > > until they're guaranteed that they can use explicit sync when
> > > > > > > > displaying on X11 or Wayland.  Are you seeing the problem?
> > > > > > > > 
> > > > > > > > To make matters worse, since most things are doing implicit
> > > > > > > > synchronization today, it's really easy to get your explicit
> > > > > > > > synchronization wrong and never notice.  If you forget to pass a
> > > > > > > > sync_file into one place (say you never notice KMS doesn't support
> > > > > > > > them), it will probably work anyway thanks to all the implicit sync
> > > > > > > > that's going on elsewhere.
> > > > > > > > 
> > > > > > > > So, clearly, we all need to go write piles of code that we can't
> > > > > > > > actually properly test until everyone else has written their piece and
> > > > > > > > then we use explicit sync if and only if all components support it.
> > > > > > > > Really?  We're going to do multiple years of development and then just
> > > > > > > > hope it works when we finally flip the switch?  That doesn't sound
> > > > > > > > like a good plan to me.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > ## A proposal: Implicit and explicit sync together
> > > > > > > > 
> > > > > > > > How to solve all these chicken-and-egg problems is something I've been
> > > > > > > > giving quite a bit of thought (and talking with many others about) in
> > > > > > > > the last couple of years.  One motivation for this is that we have to
> > > > > > > > deal with a mismatch in Vulkan.  Another motivation is that I'm
> > > > > > > > becoming increasingly unhappy with the way that synchronization,
> > > > > > > > memory residency, and command submission are inherently intertwined in
> > > > > > > > i915 and would like to break things apart.  Towards that end, I have
> > > > > > > > an actual proposal.
> > > > > > > > 
> > > > > > > > A couple weeks ago, I sent a series of patches to the dri-devel
> > > > > > > > mailing list which adds a pair of new ioctls to dma-buf which allow
> > > > > > > > userspace to manually import or export a sync_file from a dma-buf.
> > > > > > > > The idea is that something like a Wayland compositor can switch to
> > > > > > > > 100% explicit sync internally once the ioctl is available.  If it gets
> > > > > > > > buffers in from a client that doesn't use the explicit sync extension,
> > > > > > > > it can pull a sync_file from the dma-buf and use that exactly as it
> > > > > > > > would a sync_file passed via the explicit sync extension.  When it
> > > > > > > > goes to scan out a user buffer and discovers that KMS doesn't accept
> > > > > > > > sync_files (or if it tries to use that pesky media encoder no one has
> > > > > > > > converted), it can take it's sync_file for display and stuff it into
> > > > > > > > the dma-buf before handing it to KMS.
> > > > > > > > 
> > > > > > > > Along with the kernel patches, I've also implemented support for this
> > > > > > > > in the Vulkan WSI code used by ANV and RADV.  With those patches, the
> > > > > > > > only requirement on the Vulkan drivers is that you be able to export
> > > > > > > > any VkSemaphore as a sync_file and temporarily import a sync_file into
> > > > > > > > any VkFence or VkSemaphore.  As long as that works, the core Vulkan
> > > > > > > > driver only ever sees explicit synchronization via sync_file.  The WSI
> > > > > > > > code uses these new ioctls to translate the implicit sync of X11 and
> > > > > > > > Wayland to the explicit sync the Vulkan driver wants.
> > > > > > > > 
> > > > > > > > I'm hoping (and here's where I want a sanity check) that a simple API
> > > > > > > > like this will allow us to finally start moving the Linux ecosystem
> > > > > > > > over to explicit synchronization one piece at a time in a way that's
> > > > > > > > actually correct.  (No Wayland explicit sync with compositors hoping
> > > > > > > > KMS magically works even though it doesn't have a sync_file API.)
> > > > > > > > Once some pieces in the ecosystem start moving, there will be
> > > > > > > > motivation to start moving others and maybe we can actually build the
> > > > > > > > momentum to get most everything converted.
> > > > > > > > 
> > > > > > > > For reference, you can find the kernel RFC patches and mesa MR here:
> > > > > > > > 
> > > > > > > > https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html
> > > > > > > > 
> > > > > > > > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037
> > > > > > > > 
> > > > > > > > At this point, I welcome your thoughts, comments, objections, and
> > > > > > > > maybe even help/review. :-)