RFC: Add write flag to reservation object fences

christian.koenig@xxxxxxx (Christian König) · Fri, 10 Aug 2018 11:14:03 +0200

Am 10.08.2018 um 10:29 schrieb Daniel Vetter:
> [SNIP]
> I'm only interested in the case of shared buffers. And for those you
> _do_ pessimistically assume that all access must be implicitly synced.
> Iirc amdgpu doesn't support EGL_ANDROID_native_fence_sync, so this
> makes sense that you don't bother with it.

See flag AMDGPU_GEM_CREATE_EXPLICIT_SYNC.

>
>>> - as a consequence, amdgpu needs to pessimistically assume that all
>>> writes to shared buffer need to obey implicit fencing rules.
>>> - for shared buffers (across process or drivers) implicit fencing does
>>> _not_ allow concurrent writers. That limitation is why people want to
>>> do explicit fencing, and it's the reason why there's only 1 slot for
>>> an exclusive. Note I really mean concurrent here, a queue of in-flight
>>> writes by different batches is perfectly fine. But it's a fully
>>> ordered queue of writes.
>>> - but as a consequence of amdgpu's lack of implicit fencing and hence
>>> need to pessimistically assume there's multiple write fences amdgpu
>>> needs to put multiple fences behind the single exclusive slot. This is
>>> a limitation imposed by by the amdgpu stack, not something inherit to
>>> how implicit fencing works.
>>> - Chris Wilson's patch implements all this (and afaics with a bit more
>>> coffee, correctly).
>>>
>>> If you want to be less pessimistic in amdgpu for shared buffers, you
>>> need to start tracking which shared buffer access need implicit and
>>> which explicit sync. What you can't do is suddenly create more than 1
>>> exclusive fence, that's not how implicit fencing works. Another thing
>>> you cannot do is force everyone else (in non-amdgpu or core code) to
>>> sync against _all_ writes, because that forces implicit syncing. Which
>>> people very much don't want.
>>
>> I also do see the problem that most other hardware doesn't need that
>> functionality, because it is driven by a single engine. That's why I tried
>> to keep the overhead as low as possible.
>>
>> But at least for amdgpu (and I strongly suspect for nouveau as well) it is
>> absolutely vital in a number of cases to allow concurrent accesses from the
>> same client even when the BO is then later used with implicit
>> synchronization.
>>
>> This is also the reason why the current workaround is so problematic for us.
>> Cause as soon as the BO is shared with another (non-amdgpu) device all
>> command submissions to it will be serialized even when they come from the
>> same client.
>>
>> Would it be an option extend the concept of the "owner" of the BO amdpgu
>> uses to other drivers as well?
>>
>> When you already have explicit synchronization insider your client, but not
>> between clients (e.g. still uses DRI2 or DRI3), this could also be rather
>> beneficial for others as well.
> Again: How you synchronize your driver internal rendering is totally
> up to you. If you see an exclusive fence by amdgpu, and submit new
> rendering by amdgpu, you can totally ignore the exclusive fence. The
> only api contracts for implicit fencing are between drivers for shared
> buffers. If you submit rendering to a shared buffer in parallel, all
> without attaching an exclusive fence that's perfectly fine, but
> somewhen later on, depending upon protocol (glFlush or glxSwapBuffers
> or whatever) you have to collect all those concurrent write hazards
> and bake them into 1 single exclusive fence for implicit fencing.
>
> Atm (and Chris seems to concur) the amdgpu uapi doesn't allow you to
> do that, so for anything shared you have to be super pessimistic.
> Adding a HAND_OFF_FOR_IMPLICIT_FENCING flag/ioctl would probably fix
> that. Only when that flag is set would you take all shared write
> hazards and bake them into one exclusive fence for hand-off to the
> next driver. You'd also need the same when receiving an implicitly
> fenced buffer, to make sure that your concurrent writes do synchronize
> with reading (aka shared fences) done by other drivers. With a bunch
> of trickery and hacks it might be possible to infer this from current
> ioctls even, but you need to be really careful.

A new uapi is out of question because we need to be backward compatible.

> And you're right that amdgpu seems to be the only (or one of the only)
> drivers which do truly concurrent rendering to the same buffer (not
> just concurrent rendering to multiple buffers all suballocated from
> the same bo). But we can't fix this in the kernel with the tricks you
> propose, because without such an uapi extension (or inference) we
> can't tell the implicit fencing from the explicit fencing case.

Sure we can. As I said for amdgpu that is absolutely no problem at all.

In your terminology all rendering from the same client to a BO is 
explicitly fenced, while all rendering from different clients are 
implicit fenced.

> And for shared buffers with explicit fencing we _must_ _not_ sync against
> all writes. owner won't help here, because it's still not tracking
> whether something is explicit or implicit synced.

Implicit syncing can be disable by giving the 
AMDGPU_GEM_CREATE_EXPLICIT_SYNC flag while creating the BO.

> We've cheated a bit with most other drivers in this area, also because
> we don't have to deal with truly concurrent rendering.

Yeah, absolutely nailed it. And this cheating is now completely breaking 
my neck because it doesn't work well at all with the requirements I have 
at hand here.

> So it's not
> obvious that we're not tracking writes/reads, but implicit/explicit
> fencing. But semantically we track the later for shared buffers.
>
> Cheers, Daniel
>
> PS: One idea I have for inference: Every time you see a shared buffer
> in an amdgpu CS:
> 1. Grab reservation lock
> 2. Check all the fences' creators. If any of them are foreign (not by
> amdgpu), then run the current pessimistic code.

That is exactly what we already do.

> 3. If all fences are by amdgpu
> - Look at the exclusive fence. If it's a ttm bo move keep it, if it's
> marked as a special implicit syncing fence, ignore it.
> - Run all CS concurrently by storing all their write fences in the shared slots.
> - Create a fake exclusive fence which ties all the write hazards into
> one fence. Mark them as special implicit syncing fences in your
> amdgpu_fence struct. This will make sure other drivers sync properly,
> but since you ignore these special it won't reduce amdgpu-internal
> concurrency.

That won't work, adding the exclusive fence removes all shared fences 
and I still need to have those for the sync check above.

I need something like a callback from other drivers that the reservation 
object is now used by them and no longer by amdgpu.

Regards,
Christian.

> - Make sure you don't drop the ttm bo move fences accidentally, will
> be some tricky accounting.
> 4. Submit CS and drop reservation lock.
>
> I think this would work, but much cleaner if you make this an explicit
> part of the amgpu uapi.