Am 10.08.2018 um 10:29 schrieb Daniel Vetter: > [SNIP] > I'm only interested in the case of shared buffers. And for those you > _do_ pessimistically assume that all access must be implicitly synced. > Iirc amdgpu doesn't support EGL_ANDROID_native_fence_sync, so this > makes sense that you don't bother with it. See flag AMDGPU_GEM_CREATE_EXPLICIT_SYNC. > >>> - as a consequence, amdgpu needs to pessimistically assume that all >>> writes to shared buffer need to obey implicit fencing rules. >>> - for shared buffers (across process or drivers) implicit fencing does >>> _not_ allow concurrent writers. That limitation is why people want to >>> do explicit fencing, and it's the reason why there's only 1 slot for >>> an exclusive. Note I really mean concurrent here, a queue of in-flight >>> writes by different batches is perfectly fine. But it's a fully >>> ordered queue of writes. >>> - but as a consequence of amdgpu's lack of implicit fencing and hence >>> need to pessimistically assume there's multiple write fences amdgpu >>> needs to put multiple fences behind the single exclusive slot. This is >>> a limitation imposed by by the amdgpu stack, not something inherit to >>> how implicit fencing works. >>> - Chris Wilson's patch implements all this (and afaics with a bit more >>> coffee, correctly). >>> >>> If you want to be less pessimistic in amdgpu for shared buffers, you >>> need to start tracking which shared buffer access need implicit and >>> which explicit sync. What you can't do is suddenly create more than 1 >>> exclusive fence, that's not how implicit fencing works. Another thing >>> you cannot do is force everyone else (in non-amdgpu or core code) to >>> sync against _all_ writes, because that forces implicit syncing. Which >>> people very much don't want. >> >> I also do see the problem that most other hardware doesn't need that >> functionality, because it is driven by a single engine. That's why I tried >> to keep the overhead as low as possible. >> >> But at least for amdgpu (and I strongly suspect for nouveau as well) it is >> absolutely vital in a number of cases to allow concurrent accesses from the >> same client even when the BO is then later used with implicit >> synchronization. >> >> This is also the reason why the current workaround is so problematic for us. >> Cause as soon as the BO is shared with another (non-amdgpu) device all >> command submissions to it will be serialized even when they come from the >> same client. >> >> Would it be an option extend the concept of the "owner" of the BO amdpgu >> uses to other drivers as well? >> >> When you already have explicit synchronization insider your client, but not >> between clients (e.g. still uses DRI2 or DRI3), this could also be rather >> beneficial for others as well. > Again: How you synchronize your driver internal rendering is totally > up to you. If you see an exclusive fence by amdgpu, and submit new > rendering by amdgpu, you can totally ignore the exclusive fence. The > only api contracts for implicit fencing are between drivers for shared > buffers. If you submit rendering to a shared buffer in parallel, all > without attaching an exclusive fence that's perfectly fine, but > somewhen later on, depending upon protocol (glFlush or glxSwapBuffers > or whatever) you have to collect all those concurrent write hazards > and bake them into 1 single exclusive fence for implicit fencing. > > Atm (and Chris seems to concur) the amdgpu uapi doesn't allow you to > do that, so for anything shared you have to be super pessimistic. > Adding a HAND_OFF_FOR_IMPLICIT_FENCING flag/ioctl would probably fix > that. Only when that flag is set would you take all shared write > hazards and bake them into one exclusive fence for hand-off to the > next driver. You'd also need the same when receiving an implicitly > fenced buffer, to make sure that your concurrent writes do synchronize > with reading (aka shared fences) done by other drivers. With a bunch > of trickery and hacks it might be possible to infer this from current > ioctls even, but you need to be really careful. A new uapi is out of question because we need to be backward compatible. > And you're right that amdgpu seems to be the only (or one of the only) > drivers which do truly concurrent rendering to the same buffer (not > just concurrent rendering to multiple buffers all suballocated from > the same bo). But we can't fix this in the kernel with the tricks you > propose, because without such an uapi extension (or inference) we > can't tell the implicit fencing from the explicit fencing case. Sure we can. As I said for amdgpu that is absolutely no problem at all. In your terminology all rendering from the same client to a BO is explicitly fenced, while all rendering from different clients are implicit fenced. > And for shared buffers with explicit fencing we _must_ _not_ sync against > all writes. owner won't help here, because it's still not tracking > whether something is explicit or implicit synced. Implicit syncing can be disable by giving the AMDGPU_GEM_CREATE_EXPLICIT_SYNC flag while creating the BO. > We've cheated a bit with most other drivers in this area, also because > we don't have to deal with truly concurrent rendering. Yeah, absolutely nailed it. And this cheating is now completely breaking my neck because it doesn't work well at all with the requirements I have at hand here. > So it's not > obvious that we're not tracking writes/reads, but implicit/explicit > fencing. But semantically we track the later for shared buffers. > > Cheers, Daniel > > PS: One idea I have for inference: Every time you see a shared buffer > in an amdgpu CS: > 1. Grab reservation lock > 2. Check all the fences' creators. If any of them are foreign (not by > amdgpu), then run the current pessimistic code. That is exactly what we already do. > 3. If all fences are by amdgpu > - Look at the exclusive fence. If it's a ttm bo move keep it, if it's > marked as a special implicit syncing fence, ignore it. > - Run all CS concurrently by storing all their write fences in the shared slots. > - Create a fake exclusive fence which ties all the write hazards into > one fence. Mark them as special implicit syncing fences in your > amdgpu_fence struct. This will make sure other drivers sync properly, > but since you ignore these special it won't reduce amdgpu-internal > concurrency. That won't work, adding the exclusive fence removes all shared fences and I still need to have those for the sync check above. I need something like a callback from other drivers that the reservation object is now used by them and no longer by amdgpu. Regards, Christian. > - Make sure you don't drop the ttm bo move fences accidentally, will > be some tricky accounting. > 4. Submit CS and drop reservation lock. > > I think this would work, but much cleaner if you make this an explicit > part of the amgpu uapi.