Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)

Jason Ekstrand <jason@xxxxxxxxxxxxxx> · Wed, 26 May 2021 10:24:05 -0500

On Wed, May 26, 2021 at 6:09 AM Daniel Stone <daniel@xxxxxxxxxxxxx> wrote:
> On Mon, 24 May 2021 at 18:11, Jason Ekstrand <jason@xxxxxxxxxxxxxx> wrote:
> >  3. Userspace memory fences.
> >
> > Note that timeline syncobj is NOT in that list.  IMO, all the "wait
> > for submit" stuff is an implementation detail we needed in order to
> > get the timeline semantics on top of immutable SW fences.  Under the
> > hood it's all dma_fence; this just gives us a shareable container so
> > we can implement VK_KHR_timeline_semaphore with sharing.  I really
> > don't want to make Wayland protocol around it if memory fences are the
> > final solution.
>
> Typing out the Wayland protocol isn't the hard bit. If we just need to
> copy and sed syncobj to weirdsyncobj, no problem really, and it gives
> us a six-month head start on painful compositor-internal surgery
> whilst we work on common infrastructure to ship userspace fences
> around (mappable dmabuf with the sync bracketing? FD where every
> read() gives you the current value? memfd? other?).

I feel like I should elaborate more about timelines.  In my earlier
reply, my commentary about timeline syncobj was mostly focused around
helping people avoid typing.  That's not really the full story,
though, and I hope more context will help.

First, let me say that timeline syncobj was designed as a mechanism to
implement VK_KHR_timeline_semaphore without inserting future fences
into the kernel.  It's entirely designed around the needs of Vulkan
drivers, not really as a window-system primitive.  The semantics are
designed around one driver communicating to another that new fences
have been added and it's safe to kick off more rendering.  I'm not
convinced that it's the right object for window-systems and I'm also
not convinced that it's a good idea to try and make a version of it
that's a wrapper around a userspace memory fence.  (I'm going to start
typing UMF for userspace memory fence because it's long to type out.)

Why?  Well, the fundamental problem with timelines in general is
trying to figure out when it's about to be done.  But timeline syncobj
solves this for us!  It gives us this fancy super-useful ioctl!
Right?  Uh.... not as well as I'd like.  Let's say we make a timeline
syncobj that's a wrapper around a userspace memory fence.  What do we
do with that ioctl?  As I mentioned above, the kernel doesn't have any
clue when it will be triggered so that ioctl turns into an actual
wait.  That's no good because it creates unnecessary stalls.

There's another potential solution here:  Have each UMF be two
timelines: submitted and completed.  At the start of every batch
that's supposed to trigger a UMF, we set the "submitted" side and
then, when it completes, we set the "completed" side.  Ok, great, now
we can get at the "about to be done" with the submitted side,
implement the ioctl, and we're all good, right?  Sadly, no.  There's
no guarantee about how long a "batch" takes.  So there's no universal
timeout the kernel can apply.  Also, if it does time out, the kernel
doesn't know who to blame for the timeout and how to prevent itself
from getting in trouble again.  The compositor does so, in theory,
given the right ioctls, it could detect the -ETIME and kill that
client.  Not a great solution.

The best option I've been able to come up with for this is some sort
of client-provided signal.  Something where it says, as part of submit
or somewhere else, "I promise I'll be done soon" where that promise
comes with dire consequences if it's not.  At that point, we can turn
the UMF and a particular wait value into a one-shot fence like a
dma_fence or sync_file, or signal a syncobj on it.  If it ever times
out, we kick their context.  In Vulkan terminology, they get
VK_ERROR_DEVICE_LOST.  There are two important bits here:  First, is
that it's based on a client-provided thing.  With a fully timeline
model and wait-before-signal, we can't infer when something is about
to be done.  Only the client knows when it submitted its last node in
the dependency graph and the whole mess is unblocked.  Second, is that
the dma_fence is created within the client's driver context.  If it's
created compositor-side, the kernel doesn't know who to blame if
things go badly.  If we create it in the client, it's pretty easy to
make context death on -ETIME part of the contract.

(Before danvet jumps in here and rants about how UMF -> dma_fence
isn't possible, I haven't forgotten.  I'm pretending, for now, that
we've solved some of those problems.)

Another option is to just stall on the UMF until it's done.  Yeah,
kind-of terrible and high-latency, but it always works and doesn't
involve any complex logic to kill clients.  If a client never gets
around to signaling a fence, it just never repaints.  The compositor
keeps going like nothing's wrong.  Maybe, if the client submits lots
of frames without ever triggering, it'll hit some max queue depth
somewhere and kill it but that's it.  More likely, the client's
vkAcquireNextImage will start timing out and it'll crash.

I suspect where we might actually land is some combination of the two
depending on client choice.  If the client wants to be dumb, it gets
the high-latency always-works path.  If the client really wants
lowest-latency VRR, it has to take the smarter path and risk
VK_ERROR_DEVICE_LOST if it misses too far.

But the point of all of this is, neither of the above two paths have
anything to do with the compositor calling a "wait for submit" ioctl.
Building a design around that and baking it into protocol is, IMO, a
mistake.  I don't see any valid way to handle this mess without "wait
for sumbit" either not existing or existing only client-side for the
purposes of WSI.

--Jason