Re: RFC: Add write flag to reservation object fences

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 10.08.2018 um 10:29 schrieb Daniel Vetter:
[SNIP]
I'm only interested in the case of shared buffers. And for those you
_do_ pessimistically assume that all access must be implicitly synced.
Iirc amdgpu doesn't support EGL_ANDROID_native_fence_sync, so this
makes sense that you don't bother with it.

See flag AMDGPU_GEM_CREATE_EXPLICIT_SYNC.


- as a consequence, amdgpu needs to pessimistically assume that all
writes to shared buffer need to obey implicit fencing rules.
- for shared buffers (across process or drivers) implicit fencing does
_not_ allow concurrent writers. That limitation is why people want to
do explicit fencing, and it's the reason why there's only 1 slot for
an exclusive. Note I really mean concurrent here, a queue of in-flight
writes by different batches is perfectly fine. But it's a fully
ordered queue of writes.
- but as a consequence of amdgpu's lack of implicit fencing and hence
need to pessimistically assume there's multiple write fences amdgpu
needs to put multiple fences behind the single exclusive slot. This is
a limitation imposed by by the amdgpu stack, not something inherit to
how implicit fencing works.
- Chris Wilson's patch implements all this (and afaics with a bit more
coffee, correctly).

If you want to be less pessimistic in amdgpu for shared buffers, you
need to start tracking which shared buffer access need implicit and
which explicit sync. What you can't do is suddenly create more than 1
exclusive fence, that's not how implicit fencing works. Another thing
you cannot do is force everyone else (in non-amdgpu or core code) to
sync against _all_ writes, because that forces implicit syncing. Which
people very much don't want.

I also do see the problem that most other hardware doesn't need that
functionality, because it is driven by a single engine. That's why I tried
to keep the overhead as low as possible.

But at least for amdgpu (and I strongly suspect for nouveau as well) it is
absolutely vital in a number of cases to allow concurrent accesses from the
same client even when the BO is then later used with implicit
synchronization.

This is also the reason why the current workaround is so problematic for us.
Cause as soon as the BO is shared with another (non-amdgpu) device all
command submissions to it will be serialized even when they come from the
same client.

Would it be an option extend the concept of the "owner" of the BO amdpgu
uses to other drivers as well?

When you already have explicit synchronization insider your client, but not
between clients (e.g. still uses DRI2 or DRI3), this could also be rather
beneficial for others as well.
Again: How you synchronize your driver internal rendering is totally
up to you. If you see an exclusive fence by amdgpu, and submit new
rendering by amdgpu, you can totally ignore the exclusive fence. The
only api contracts for implicit fencing are between drivers for shared
buffers. If you submit rendering to a shared buffer in parallel, all
without attaching an exclusive fence that's perfectly fine, but
somewhen later on, depending upon protocol (glFlush or glxSwapBuffers
or whatever) you have to collect all those concurrent write hazards
and bake them into 1 single exclusive fence for implicit fencing.

Atm (and Chris seems to concur) the amdgpu uapi doesn't allow you to
do that, so for anything shared you have to be super pessimistic.
Adding a HAND_OFF_FOR_IMPLICIT_FENCING flag/ioctl would probably fix
that. Only when that flag is set would you take all shared write
hazards and bake them into one exclusive fence for hand-off to the
next driver. You'd also need the same when receiving an implicitly
fenced buffer, to make sure that your concurrent writes do synchronize
with reading (aka shared fences) done by other drivers. With a bunch
of trickery and hacks it might be possible to infer this from current
ioctls even, but you need to be really careful.

A new uapi is out of question because we need to be backward compatible.

And you're right that amdgpu seems to be the only (or one of the only)
drivers which do truly concurrent rendering to the same buffer (not
just concurrent rendering to multiple buffers all suballocated from
the same bo). But we can't fix this in the kernel with the tricks you
propose, because without such an uapi extension (or inference) we
can't tell the implicit fencing from the explicit fencing case.

Sure we can. As I said for amdgpu that is absolutely no problem at all.

In your terminology all rendering from the same client to a BO is explicitly fenced, while all rendering from different clients are implicit fenced.

And for shared buffers with explicit fencing we _must_ _not_ sync against
all writes. owner won't help here, because it's still not tracking
whether something is explicit or implicit synced.

Implicit syncing can be disable by giving the AMDGPU_GEM_CREATE_EXPLICIT_SYNC flag while creating the BO.

We've cheated a bit with most other drivers in this area, also because
we don't have to deal with truly concurrent rendering.

Yeah, absolutely nailed it. And this cheating is now completely breaking my neck because it doesn't work well at all with the requirements I have at hand here.

So it's not
obvious that we're not tracking writes/reads, but implicit/explicit
fencing. But semantically we track the later for shared buffers.

Cheers, Daniel

PS: One idea I have for inference: Every time you see a shared buffer
in an amdgpu CS:
1. Grab reservation lock
2. Check all the fences' creators. If any of them are foreign (not by
amdgpu), then run the current pessimistic code.

That is exactly what we already do.

3. If all fences are by amdgpu
- Look at the exclusive fence. If it's a ttm bo move keep it, if it's
marked as a special implicit syncing fence, ignore it.
- Run all CS concurrently by storing all their write fences in the shared slots.
- Create a fake exclusive fence which ties all the write hazards into
one fence. Mark them as special implicit syncing fences in your
amdgpu_fence struct. This will make sure other drivers sync properly,
but since you ignore these special it won't reduce amdgpu-internal
concurrency.

That won't work, adding the exclusive fence removes all shared fences and I still need to have those for the sync check above.

I need something like a callback from other drivers that the reservation object is now used by them and no longer by amdgpu.

Regards,
Christian.

- Make sure you don't drop the ttm bo move fences accidentally, will
be some tricky accounting.
4. Submit CS and drop reservation lock.

I think this would work, but much cleaner if you make this an explicit
part of the amgpu uapi.

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux