Re: [PATCH 4/4] RFC: dma-buf: Add an API for importing sync files (v6)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/24/21 8:11 PM, Jason Ekstrand wrote:
On Sat, May 22, 2021 at 3:05 PM Daniel Stone <daniel@xxxxxxxxxxxxx> wrote:

Hi,

On Thu, 20 May 2021 at 20:00, Jason Ekstrand <jason@xxxxxxxxxxxxxx> wrote:
In the Vulkan world, this is useful for dealing with the out-fence from
vkQueuePresent.  Current Linux window-systems (X11, Wayland, etc.) all
rely on dma-buf implicit sync.  Since Vulkan is an explicit sync API, we
get a set of fences (VkSemaphores) in vkQueuePresent and have to stash
those as an exclusive (write) fence on the dma-buf.  We handle it in
Mesa today with the above mentioned dummy submit trick.  This ioctl
would allow us to set it directly without the dummy submit.

This may also open up possibilities for GPU drivers to move away from
implicit sync for their kernel driver uAPI and instead provide sync
files and rely on dma-buf import/export for communicating with other
implicit sync clients.

We make the explicit choice here to only allow setting RW fences which
translates to an exclusive fence on the dma_resv.  There's no use for
read-only fences for communicating with other implicit sync userspace
and any such attempts are likely to be racy at best.

I think I almost follow, but I'm not quite seeing the race you allude
to. Let me talk through my understanding so it's hopefully more clear
for others as a summary. Please correct me on anything I've
misunderstood or just missed completely. (I thought when I wrote this
intro that the email might be relatively snappy, but it's really long
and takes in a lot of breadth. Sorry.)

So as far as I'm reading this patchset and the Mesa MR, this _only_
concerns the out-fence (i.e. compositor -> client AcquireNextImage
semaphore/fence) so far. The in-fence (client->compositor QueuePresent
wait-semaphores/fences) is handled by the driver ensuring that an
exclusive resv is placed on the union of the signal semaphores passed
to QueuePresent, either through flags on its CS ioctl, or amdgpu's BO
flags, or ... either way, no problem as it should always be exclusive,
and it seems pretty uncontroversial that we should pull the wait
semaphore into an exclusive fence so that no downstream consumer will
begin using it until the client ops have fully retired.

For AcquireNextImage, your patchset exports all the fences (shared and
exclusive both) on the dma_resv out into the semaphore/fence such that
the client can't progress (CPU-side for VkFence, GPU-side for
VkSemaphore) until all currently queued operations have completely
retired. So, as long as the server ensures that all its kernel-side
work is flushed before its IPC to unblock ANI (wl_buffer.release or
DRI3 PresentIdle, both indicating that the client is free to reuse the
buffer, subject to synchronising against implicit fences on the resv),
all good: we snapshot the current resv state, wrap the relevant
driver-side Vulkan primitive around it, and go back to explicit
synchronisation. The client can't race generating new work against
this, because it can't queue any work until ANI has returned and
queued a signaling op on the semaphore/fence.

So far, so good. I really like both your series up until this
narrative point; as I was saying in the userspace-fence thread, the
WSI<->client thread is certainly pulling a very big lever with a
heavyweight transition between the two different worlds, and I like
that the new export ioctl lets us be very clear about what exactly is
happening under the hood. Good stuff.

Glad to hear.  If you turned that into RBs on the first three patches
in this series and all but the last patch on the Mesa MR, it'd make me
even happier. :-D

At this point, I think everyone is pretty happy with the first three
patches and the export ioctl.  In the Vulkan WSI code, it solves a
significant over-synchronization issue for ANV.  Also, the uAPI
shouldn't be controversial at all because it's identical to poll()
except that it gives you a FD you can poll on later to get the result
of the poll as it would be now.  I think if we get some Mesa reviews,
we should be able to land those.  It's import that's trickier.

So, what gives with the import ioctl? Taking a guess at where you're
going, the import ioctl is going to be used in QueuePresent just as
the export ioctl is in ANI: instead of having CS flags to write into
the resv exclusive slot or per-BO flags to always dump in exclusive,
there's a forthcoming patch somewhere which lets drivers skip this and
instead have common QueuePresent code dump the wait semaphore into the
resv, so servers on the other side of an implicit-only protocol will
synchronise their access against the fence imported as part of
client-side QueuePresent?

Correct.  And the patch isn't forthcoming.  It already exists as the
top patch in the Mesa MR (!4037).

That makes sense to me and is nicely symmetrical, plus it gets GPU
drivers further out of the business of doing magic winsys/display
synchronisation, which is good. But why enforce that imported fences
have to go into the exclusive slot? It makes sense from the point of
view of clients doing QueueSubmit, but I think it might preclude other
uses for no particularly good reason.

Mostly, I was trying to limit the scope a bit.  Import is tricky and,
to be honest, I'm still not 100% sure that it's safe.  I probably
should have left RFC on this patch.

As long as we keep everything inside the kernel, there's a requirement
that any work which triggers any fence on a dma_resv waits on the
exclusive fence, if any.  Work which sets the exclusive fence has to
wait on all fences.  With the import ioctl, we can guarantee that the
fences fire in the right order by making an array fence containing the
new fence and all other fences and using that as the exclusive fence.
What we can't do, however, is ensure that the work which triggers the
fence actually blocks on ANY of the fences on the dma_resv.

Hrm... I think I've now convinced myself that importing shared fences
is no more dangerous than importing an exclusive fence because they
share the same set of problems.  Unfortunately, I've unconvinced
myself of the safety of either.  I've got to think some more....

The most convincing argument I can make to myself is that this ioctl
is like having a one-shot little queue that contains tiny little work
jobs which wait on whatever sync_file you provide, do nothing, and
then signal.  That should be safe, right?

Certainly on X11 there's no real use for the shared slot - you get
into difficulties with mixed client/server rendering - but at least
Wayland and GBM are always non-destructive, so conceptually have a
good use for being able to import fences into the shared resv. This
goes for media and KMS access as well, if you think beyond the usecase
of an explicit Vulkan client sending through to a dumb implicit server
which just wants to texture.

Media clients are definitely a relevant usecase, both directly with
V4L2/VA-API, neither of which have support for explicit
synchronisation yet and (at least for V4L2) are unlikely to do so in
the near future, but even more with pipeline frameworks like GStreamer
and PipeWire, which don't have fundamental awareness (they're prepared
to deal with deep pipelines which return at arbitrary times, but once
anything is returned, it is available for immediate use). Maybe
open-coding CPU-side waits in these users is the best way longer term,
but for easier interop if nothing else, being able to pull shared
fences in seems like it could be a win ('the compositor is still
texturing from this for now, so feel free to read back from ref
frames, but don't scribble over it until it's finished').

I'm slightly bemused that there's so much energy spent on designing
around winsys being dumb and implicit.

I'd like to address this one as it's a comment you've made several
times.  Once you've fixed raw X11 (not just XWayland) and a new
release has been made (hah!) and is shipping in distros with said
support, then we can talk.  Sorry if that comes off as overly snarky
but that's reality that we (driver devs) are living with.  To make
things even worse, when we're in Vulkan land (as opposed to GL), we
can't tell up-front whether or not our window-system supports foo
fences and adjust accordingly.  We have to start up, begin rendering,
and only later figure out "oops, this one goes to X11".  We really
can't say things like "when running on modern Wayland, do things the
new way" because Vulkan doesn't have a concept of "running on" a
window system.

FWIW, I do have a Mesa MR somewhere which adds explicit sync for
Vulkan+Wayland when the wayland protocols are there.  I don't remember
why it didn't land.  Maybe lack of acquire fence support?  I think the
protocol issues have been fixed, so we should up-rev the requirements
as needed and land it.

We did take a long time to roll
out sync_file support, but that was only because we didn't quite
understand all the nuances of why implicit sync is so difficult for
GPU drivers on modern architectures and how it was actually a win
compared to doing nothing; maybe we should have some kind of
conference where we all get together and talk to each other ... ?
Anyway, by the time we got to the cusp of rolling out bi-directional
sync_file support, suddenly we had drm_syncobj. By the time that had
semi-settled down and started to be rolled out, then we suddenly had
userspace fences on the horizon. And what do we do with those?

You're not wrong....  And that's the second reason for the gymnastics
above.  Ever since Vulkan launched, we've been fumbling around trying
to figure out how to best do synchronization.  'm reasonably certain
that userspace memory fences are the future but I'm much less certain
about the path to get there.  It's been a process of patient
experimentation so far and I think we're getting closer.  Syncobj,
timeline syncobj, etc. have all been steps along that path.  I've been
hesitant to ask the window-system people to draft piles of extensions
until we're sure we've found "the one".  It's bad enough iterating in
kernel-space and userspace without bringing Wayland protocol into each
iteration step.  For that reason, one of my goals along this process
has been to try and decouple things as much as we can.

So, in summary, none of this is because I think that window systems
are dumb and implicit or for any lack of respect for the people that
work on them.  I assume that X11 will always be dumb and implicit.
(I'd love to be proven wrong!)  For Wayland (and XWayland, probably),
I assume the people are very smart and active and will implement
whatever APIs we (the driver people) need as long as they're
reasonable.  I just don't know what to ask for yet.

We've - certainly Weston, and I'm pretty confident I speak for Simon
on the wlroots side and Jonas for Mutter - landed on accepting that
we're going to have to deal with timeline semaphores and
wait-before-signal; we have a credible plan (we think) for protocol to
deal with at least syncobj, which should conceptually extend to
userspace fences. The biggest blocker there is the syncobj uAPI.
Compositors aren't Vulkan clients - we don't have one thread per group
of work, because the inter-client synchronisation becomes nightmarish
overhead for little gain. I don't think this will ever change, because
the balance of work is totally different between client and
compositor.

Anyway, the problem with syncobj is that the ioctl to wait for a
sync_file to materialise for a given timeline point only allows us to
block with a timeout; this is a non-starter, because we need something
which fits into epoll. The most optimal case is returning a new
eventfd or similar which signals when a given timeline point becomes
available or signaled, but in extremis a syncobj FD becoming readable
when any activity which would change the result of any zero-timeout
wait on that syncobj is more or less workable.

Right.  You could call this an oversight.  Really, though, it was
because the use-cases we were interested in were ones where a wait
with a timeout was perfectly acceptable and where the extra overhead
of setting an epoll with ioctls wasn't.  It shouldn't be too hard to
wire up if that helps (cross your fingers).  But....

If we go the direction of userspace memory fences (which I think is
likely), there is no such thing as "wait for the fence to
materialize".  The work is either done or it isn't.  There is no
enforceable concept of "about to be done".  The word "enforceable" is
important there.  We could add such a concept but it'd be dependent on
the userspace client (not driver) to let us know when it's just about
ready and then we'd have to VK_ERROR_DEVICE_LOST on them or similar if
they break the contract.  Maybe possible but there's some design work
required there.  The other option is to make the compositors handle
this new way of thinking more thoroughly and eat the latency hit.

What we do want to do though, regardless of the primitive or its
semantics, is to only have to go through this once, not rework it all
in six months, and have to support a bunch of diverging codepaths. So
what is the optimal synchronisation primitive we should be aiming for
on the winsys side? Is sync_file a good enough lowest common
denominator, or should we do timeline-syncobj for fancy clients, in
tandem with sync_file bolted on the side for media and ancient GPUs?
Or are userspace fences going to give us some new primitive? And if
so, can we please talk about those semantics now, so we don't end up
with three synchronisation mechanisms which are all sort of but not
really alike?

As I said above, I think we're getting close but I don't think we're there yet.

As far as I can tell, the three relevant vendors with (more or less)
upstream drivers here are AMD, Arm, and Intel. Arm is obviously a bit
more up in the air as the hardware and specs aren't currently
available to the upstream development teams, but hopefully we can
bring them into this conversation. I think it's looking like we're
optimising for all of the following, without leaving anyone out in the
cold:

1. Modern userspace-fencing hardware running on a
userspace-fencing-aware winsys, i.e. new AMD/Arm/Intel on one of the
big three Wayland compositors which handle enough synchronisation
logic internally that the choice of underlying
composition/presentation API is immaterial (for which any new API is
important)
2. Modern userspace-fencing hardware running on Xwayland, for which
we'll inevitably have to type out new DRI3 synchronsiation, but
derived purely from the Wayland protocols so it can be passed through
quite directly, and with any mixed X11<->user buffer client
interaction (e.g. subwindows) being penalised by a long blocking wait
in the X server
3. sync_file-oriented hardware, for which we need to do ~nothing
4. Modern userspace-fencing hardware and APIs which need to interop
with media units, for all four combinations of client/server
source/sink; for some APIs like Vulkan Video synchronisation is not a
problem, for others like VA-API/V4L2/GStreamer we're probably need
going to live with the implicit semantics for the foreseeable future,
which means we should do the right thing for them up front, because
why would you even be playing games if you're not streaming them to
Twitch at the same time? (Personally I'm much happier that no-one
other than my friends who already know I'm terrible can see me playing
CoD, but y'know, kids these days ...)

The fifth case I've left out is clients who want to smash Vulkan
content directly into the display. For games and Kodi-like UI I'm just
going to ignore this, because (maybe I'm biased, but) using KMS
directly is difficult enough that you shouldn't do it and just use a
winsys because we've spent a very long time dealing with these
problems for you. The remaining usecase there is XR, but their uses
are very different, and OpenXR already deals with a _lot_ of this for
you, with the runtimes having sufficiently sophisticated
synchronisation internally that whatever we come up with won't be
onerous for them to implement. Either way, integration with KMS/GBM
has the same problem as Wayland, in that unless you fully encapsulate
it in something like VK_KHR_display, you don't get to throw a thread
under the bus to do delayed submit, because you need to actually
return something to the caller.

You're missing a use-case:  Modern userspace-fencing hardware running
on bare X11.  I don't know that we should optimize for this case, so
to speak, but it has to work non-terribly.  Also, as I alluded to
above, I really don't want to be maintaining two drivers forever: One
for X11 and ancient Wayland and one for modern Wayland.  We need to be
able to modernize the driver and still support the old window-systems.
Unless you can promise me that X11 and KDE/Wayland will either go away
or else get modernized, I need a fall-back plan.  And even if you make

Hi,

I don't know why you specifically mentioned KDE/Wayland, but we are on board with explicit sync. :)

Cheers,
Vlad

me said promise, I'm not going to believe you. :-P  8-9 years ago, I
was one of the people making those promises.  Now I'm writing X11
winsys code for drivers.  And... now I'm re-thinking some of my life
choices....

Ultimately, I think I'm leaning towards agreeing with Christian that I
would like to see a good holistic model for how this works in a
variety of usecases before we progress new uAPI, but also to agreeing
with you and Dan in that how amdgpu currently implements things is
currently objectively very wrong (despite the motivations for having
implemented it that way).

If there are any usecases I've missed then I'm all ears, but else I
think we should take this forward by working with
Vulkan/EGL/Wayland/X11/media/KMS and coming up with realistic
skeletons for end-to-end synchronisation through those usecases. It's
clear that this never happened for syncobj, because it's just not
usable as a primitive in any compositor which exists today: let's make
sure we don't get into the same trap again. If we can do this over
email/GitLab then that's great, but if not, maybe we need to do a kind
of mini-XDC with some kind of virtual whiteboard we can all scribble
over.

I think what we're looking at is roughly three steps, the first two of
which are mostly there on the Wayland side:

  1. Implicit sync.  We know what this one is.  glFlush/vkQueueSubmit
on the one side, start texturing on the other, and it all works.

  2. Immutable SW fences, i.e. sync_file.  This is where we have a
fence object that gets returned from the flush/submit and can be
passed into the texture op to provide a dependency.  Syncobj may be
useful here but only as a container.  If a sync_file is a dma_fence*,
a syncobj should be thought of as a dma_fence**.  This may be useful
if we want to retrofit sync_file into X11 where the current DRI3
explicit sync stuff is based on creating an object and then re-using
it for every present.

  3. Userspace memory fences.

Note that timeline syncobj is NOT in that list.  IMO, all the "wait
for submit" stuff is an implementation detail we needed in order to
get the timeline semantics on top of immutable SW fences.  Under the
hood it's all dma_fence; this just gives us a shareable container so
we can implement VK_KHR_timeline_semaphore with sharing.  I really
don't want to make Wayland protocol around it if memory fences are the
final solution.


(As a coda to this, I'm pretty sure I understand the subtleties of the
memory fences for relocation/shootdown, but it would be _really_ good
to have a coherent description everyone agrees on written down
somewhere, so people can understand the issues without having to read
250 emails with people at loggerheads.)

Let me give it a try.  I define a userspace memory fence as follows:
  - The fence object is fundamentally a bit of mappable gpu-accessible
memory which stores a 64-bit counter value which is used as a
timeline.  (On Windows, they require it to live in system memory.)
  - For sharable memory fences, each one probably has to go in its own page. :-(
  - The value is initialized to 0.
  - To signal the fence, someone (GPU or CPU) writes a new 64-bit value.
  - Waits on the fence are waits for it to be >= some value.

Depending on circumstances, the wait may be a 32-bit comparison and
may be >= with wrap-around.  For the purposes of window-system
protocol, I think we can assume 64-bit >= with no wrap-around.

There are a few very important points worth noting about them, however:
  - No one except the userspace client (not driver!) has any idea
when/if a value will be signaled
  - The value may never be signaled
  - Someone may submit a wait before someone submits a signal; we have
to deal with that
  - There is no concept of "about to be signaled"
  - For the kernel/firmware GPU scheduler handling waits, this means it
just keeps trying to run work and hopes the wait eventually unblocks.
It also means we need to totally decouple kernel synchronization for
memory management purposes from synchronization for work scheduling.
  - For a compositor, I'm still not 100% sure what this means.  TBD, I think.

There are some ways to work around some of these issues.  Windows has
a few tricks which we might be able to steal if we want.

Why would anyone want such a horrid thing?  Application developers
absolutely love them.  They can write massively multi-threaded apps
with piles of work queues that require very little cross-thread
synchronization and the GPU scheduler sorts it all out for them in the
end.  If you're a 3D game engine developer, this timeline model is
very powerful.  If you're a driver or window-system developer, you
really have to just embrace the lack of knowledge and trust apps.

Oof... That got long.  I hope it was informative.

--Jason


Cheers,
Daniel








When we got to
insert the RW fence, the actual fence we set as the new exclusive fence
is a combination of the sync_file provided by the user and all the other
fences on the dma_resv.  This ensures that the newly added exclusive
fence will never signal before the old one would have and ensures that
we don't break any dma_resv contracts.  We require userspace to specify
RW in the flags for symmetry with the export ioctl and in case we ever
want to support read fences in the future.

There is one downside here that's worth documenting:  If two clients
writing to the same dma-buf using this API race with each other, their
actions on the dma-buf may happen in parallel or in an undefined order.
Both with and without this API, the pattern is the same:  Collect all
the fences on dma-buf, submit work which depends on said fences, and
then set a new exclusive (write) fence on the dma-buf which depends on
said work.  The difference is that, when it's all handled by the GPU
driver's submit ioctl, the three operations happen atomically under the
dma_resv lock.  If two userspace submits race, one will happen before
the other.  You aren't guaranteed which but you are guaranteed that
they're strictly ordered.  If userspace manages the fences itself, then
these three operations happen separately and the two render operations
may happen genuinely in parallel or get interleaved.  However, this is a
case of userspace racing with itself.  As long as we ensure userspace
can't back the kernel into a corner, it should be fine.

v2 (Jason Ekstrand):
  - Use a wrapper dma_fence_array of all fences including the new one
    when importing an exclusive fence.

v3 (Jason Ekstrand):
  - Lock around setting shared fences as well as exclusive
  - Mark SIGNAL_SYNC_FILE as a read-write ioctl.
  - Initialize ret to 0 in dma_buf_wait_sync_file

v4 (Jason Ekstrand):
  - Use the new dma_resv_get_singleton helper

v5 (Jason Ekstrand):
  - Rename the IOCTLs to import/export rather than wait/signal
  - Drop the WRITE flag and always get/set the exclusive fence

v5 (Jason Ekstrand):
  - Split import and export into separate patches
  - New commit message

Signed-off-by: Jason Ekstrand <jason@xxxxxxxxxxxxxx>
---
  drivers/dma-buf/dma-buf.c    | 32 ++++++++++++++++++++++++++++++++
  include/uapi/linux/dma-buf.h |  1 +
  2 files changed, 33 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 7679562b57bfc..c9d6b9195c00c 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -417,6 +417,36 @@ static long dma_buf_export_sync_file(struct dma_buf *dmabuf,
         put_unused_fd(fd);
         return ret;
  }
+
+static long dma_buf_import_sync_file(struct dma_buf *dmabuf,
+                                    const void __user *user_data)
+{
+       struct dma_buf_sync_file arg;
+       struct dma_fence *fence, *singleton = NULL;
+       int ret = 0;
+
+       if (copy_from_user(&arg, user_data, sizeof(arg)))
+               return -EFAULT;
+
+       if (arg.flags != DMA_BUF_SYNC_RW)
+               return -EINVAL;
+
+       fence = sync_file_get_fence(arg.fd);
+       if (!fence)
+               return -EINVAL;
+
+       dma_resv_lock(dmabuf->resv, NULL);
+
+       ret = dma_resv_get_singleton_rcu(dmabuf->resv, fence, &singleton);
+       if (!ret && singleton)
+               dma_resv_add_excl_fence(dmabuf->resv, singleton);
+
+       dma_resv_unlock(dmabuf->resv);
+
+       dma_fence_put(fence);
+
+       return ret;
+}
  #endif

  static long dma_buf_ioctl(struct file *file,
@@ -465,6 +495,8 @@ static long dma_buf_ioctl(struct file *file,
  #if IS_ENABLED(CONFIG_SYNC_FILE)
         case DMA_BUF_IOCTL_EXPORT_SYNC_FILE:
                 return dma_buf_export_sync_file(dmabuf, (void __user *)arg);
+       case DMA_BUF_IOCTL_IMPORT_SYNC_FILE:
+               return dma_buf_import_sync_file(dmabuf, (const void __user *)arg);
  #endif

         default:
diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
index f902cadcbdb56..75fdde4800267 100644
--- a/include/uapi/linux/dma-buf.h
+++ b/include/uapi/linux/dma-buf.h
@@ -70,5 +70,6 @@ struct dma_buf_sync_file {
  #define DMA_BUF_SET_NAME_A     _IOW(DMA_BUF_BASE, 1, u32)
  #define DMA_BUF_SET_NAME_B     _IOW(DMA_BUF_BASE, 1, u64)
  #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_sync_file)
+#define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_sync)

  #endif
--
2.31.1





[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux