Re: [RFC] Explicit synchronization for Nouveau

James Jones <jajones@xxxxxxxxxx> · Mon, 29 Sep 2014 10:20:44 -0700

On 9/29/14 8:42 AM, Jerome Glisse wrote:
On Mon, Sep 29, 2014 at 09:43:02AM +0200, Daniel Vetter wrote:
On Fri, Sep 26, 2014 at 01:00:05PM +0300, Lauri Peltonen wrote:

Hi guys,

I'd like to start a new thread about explicit fence synchronization.  This time
with a Nouveau twist. :-)

First, let me define what I understand by implicit/explicit sync:

Implicit synchronization
* Fences are attached to buffers
* Kernel manages fences automatically based on buffer read/write access

Explicit synchronization
* Fences are passed around independently
* Kernel takes and emits fences to/from user space when submitting work

Implicit synchronization is already implemented in open source drivers, and
works well for most use cases.  I don't seek to change any of that.  My
proposal aims at allowing some drm drivers to operate in explicit sync mode to
get maximal performance, while still remaining fully compatible with the
implicit paradigm.

Yeah, pretty much what we have in mind on the i915 side too. I didn't look
too closely at your patches, so just a few high level comments on your rfc
here.

I will try to explain why I think we should support the explicit model as well.

1. Bindless graphics

Bindless graphics is a central concept when trying to reduce the OpenGL driver
overhead.  The idea is that the application can bind a large set of buffers to
the working set up front using extensions such as GL_ARB_bindless_texture, and
they remain resident until the application releases them (note that compute
APIs have typically similar semantics).  These working sets can be huge,
hundreds or even thousands of buffers, so we would like to opt out from the
per-submit overhead of acquiring locks, waiting for fences, and storing fences.
Automatically synchronizing these working sets in kernel will also prevent
parallelism between channels that are sharing the working set (in fact sharing
just one buffer from the working set will cause the jobs of the two channels to
be serialized).

2. Evolution of graphics APIs

The graphics API evolution seems to be going to a direction where game engine
and middleware vendors demand more control over work submission and
synchronization.  We expect that this trend will continue, and more and more
synchronization decisions will be pushed to the API level.  OpenGL and EGL
already provide good explicit command stream level synchronization primitives:
glFenceSync and EGL_KHR_wait_sync.  Their use is also encouraged - for example
EGL_KHR_image_base spec clearly states that the application is responsible for
synchronizing accesses to EGLImages.  If the API that is exposed to developers
gives the control over synchronization to the developer, then implicit waits
that are inserted by the kernel are unnecessary and unexpected, and can
severely hurt performance.  It also makes it easy for the developer to write
code that happens to work on Linux because of implicit sync, but will fail on
other platforms.

3. Suballocation

Using user space suballocation can help reduce the overhead when a large number
of small textures are used.  Synchronizing suballocated surfaces implicitly in
kernel doesn't make sense - many channels should be able to access the same
kernel-level buffer object simultaneously.

4. Buffer sharing complications

This is not really an argument for explicit sync as such, but I'd like to point
out that sharing buffers across SoC engines is often much more complex than
just exporting and importing a dma-buf and waiting for the dma-buf fences.
Sometimes we need to do color format or tiling layout conversion.  Sometimes,
at least on Tegra, we need to decompress buffers when we pass them from the GPU
to an engine that doesn't support framebuffer compression.  These things are
not uncommon, particularly when we have SoC's that combine licensed IP blocks
from different vendors.  My point is that user space is already heavily
involved when sharing buffers between drivers, and giving it some more control
over synchronization is not adding that much complexity.

Because of the above arguments, I think it makes sense to let some user space
drm drivers opt out from implicit synchronization, while allowing them to still
remain fully compatible with the rest of the drm world that uses implicit
synchronization.  In practice, this would require three things:

(1) Support passing fences (that are not tied to buffer objects) between kernel
     and user space.

(2) Stop automatically storing fences to the buffers that user space wants to
     synchronize explicitly.

The problem with this approach is that you then need hw faulting to make
sure the memory is there. Implicit fences aren't just used for syncing,
but also to make sure that the gpu still has access to the buffer as long
as it needs it. So you need at least a non-exclusive fence attached for
each command submission.

Of course on Android you don't have swap (would kill the puny mmc within
seconds) and you don't care for letting userspace pin most of memory for
gfx. So you'll get away with no fences at all. But for upstream I don't
see a good solution unfortunately. Ideas very much welcome.

Well i am gonna repeat myself. But yes you can do explicit without associating
fence (at least no struct alloc) just associate a unique per command stream
number and have it be global ie irrespective of different execution pipeline
your hw have.

For non scheduling GPU, today generation roughly, you keep buffer on lru and
you know you can not evict buffer to swap for those that have an active id
(hw did not yet write the sequence number back). You update the lru with each
command stream ioctl.

For binding to GPU GART you can do that as a preamble to the command stream
which most hardware (AMD, Intel, NVidia) should be able to do.

For VRAM you have several choice that depends on how you want to manage VRAM.
For instance you might want to use it more like a cache and have each command
stream preamble with a bunch of copy to VRAM and posibly bunch of post copy
back to RAM. Or you can hold to current scheme but buffer move now becomes
preamble to command stream (ie buffer move are scheduled as a preamble to
command stream).

Additionally, I think the goal is to move to a model where some 
higher-level object such as a working set, rather than individual 
buffers, are assigned counters or sync primitives on a per-submission 
basis.  Versioning off tags for individual buffers then moves to working 
set modification time.  This is more feasible if the only thing that 
needs precise fencing of individual surfaces is lifetime management.

The trend seems to be towards establishing a relatively large working 
set up front and then submitting many command buffers against it, 
perhaps with incremental modifications to the working set along the way. 
 This may be what's referred to as the Android model above, but I view 
it as the "non-glitchy graphic" model going forward.

Thanks,
-James

So i do not see what you would consider rocket science about this ?

Cheers,
Jérôme

(3) Allow user space to attach an explicit fence to dma-buf when exporting to
     another driver that uses implicit sync.

There are still some open issues beyond these.  For example, can we skip
acquiring the ww mutex for explicitly synchronized buffers?  I think we could
eventually, at least on unified memory systems where we don't need to migrate
between heaps (our downstream Tegra GPU driver does not lock any buffers at
submit, it just grabs refcounts for hw).  Another quirk is that now Nouveau
waits on the buffer fences when closing the gem object to ensure that it
doesn't unmap too early.  We need to rework that for explicit sync, but that
shouldn't be difficult.

See above, but you can't avoid to attach fences as long as we still use a
buffer-object based gfx memory management model. At least afaics. Which
means you need the ordering guarantees imposed by ww mutexes to ensure
that the oddball implicit ordered client can't deadlock the kernel's
memory management code.

I have written a prototype that demonstrates (1) by adding explicit sync fd
support to Nouveau.  It's not a lot of code, because I only use a relatively
small subset of the android sync driver functionality.  Thanks to Maarten's
rewrite, all I need to do is to allow creating a sync_fence from a drm fence in
order to pass it to user space.  I don't need to use sync_pt or sync_timeline,
or fill in sync_timeline_ops.

I can see why the upstream has been reluctant to de-stage the android sync
driver in its current form, since (even though it now builds on struct fence)
it still duplicates some of the drm fence concepts.  I'd like to think that my
patches only use the parts of the android sync driver that genuinely are
missing from the drm fence model: allowing user space to operate on fence
objects that are independent of buffer objects.

Imo de-staging the android syncpt stuff needs to happen first, before
drivers can use it. Since non-staging stuff really shouldn't depend upon
code from staging.

The last two patches are mocks that show how (2) and (3) might work out.  I
haven't done any testing with them yet.  Before going any further, I'd like to
get your feedback.  Can you see the benefits of explicit sync as an alternative
synchronization model?  Do you think we could use the android sync_fence for
passing fences between user space?  Or did you have something else in mind for
explicit sync in the drm world?

I'm all for adding explicit syncing. Our plans are roughly.
- Add both an in and and out fence to execbuf to sync with other rendering
   and give userspace a fence back. Needs to different flags probably.

- Maybe add an ioctl to dma-bufs to get at the current implicit fences
   attached to them (both an exclusive and non-exclusive version). This
   should help with making explicit and implicit sync work together nicely.

- Add fence support to kms. Probably only worth it together with the new
   atomic stuff. Again we need an in fence to wait for (one for each
   buffer) and an out fence. The later can easily be implemented by
   extending struct drm_event, which means not a single driver code line
   needs to be changed for this.

- For de-staging android syncpts we need to de-clutter the internal
   interfaces and also review all the ioctls exposed. Like you say it
   should be just the userspace interface for struct drm_fence. Also, it
   needs testcases and preferrably manpages.

Unfortunately it looks like Intel won't do this all for you due to a bunch
of hilarious internal reasons :( At least not anytime soon.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel