On Fri, Aug 10, 2018 at 11:14 AM, Christian König <christian.koenig at amd.com> wrote: > Am 10.08.2018 um 10:29 schrieb Daniel Vetter: >> >> [SNIP] >> I'm only interested in the case of shared buffers. And for those you >> _do_ pessimistically assume that all access must be implicitly synced. >> Iirc amdgpu doesn't support EGL_ANDROID_native_fence_sync, so this >> makes sense that you don't bother with it. > > > See flag AMDGPU_GEM_CREATE_EXPLICIT_SYNC. > > >> >>>> - as a consequence, amdgpu needs to pessimistically assume that all >>>> writes to shared buffer need to obey implicit fencing rules. >>>> - for shared buffers (across process or drivers) implicit fencing does >>>> _not_ allow concurrent writers. That limitation is why people want to >>>> do explicit fencing, and it's the reason why there's only 1 slot for >>>> an exclusive. Note I really mean concurrent here, a queue of in-flight >>>> writes by different batches is perfectly fine. But it's a fully >>>> ordered queue of writes. >>>> - but as a consequence of amdgpu's lack of implicit fencing and hence >>>> need to pessimistically assume there's multiple write fences amdgpu >>>> needs to put multiple fences behind the single exclusive slot. This is >>>> a limitation imposed by by the amdgpu stack, not something inherit to >>>> how implicit fencing works. >>>> - Chris Wilson's patch implements all this (and afaics with a bit more >>>> coffee, correctly). >>>> >>>> If you want to be less pessimistic in amdgpu for shared buffers, you >>>> need to start tracking which shared buffer access need implicit and >>>> which explicit sync. What you can't do is suddenly create more than 1 >>>> exclusive fence, that's not how implicit fencing works. Another thing >>>> you cannot do is force everyone else (in non-amdgpu or core code) to >>>> sync against _all_ writes, because that forces implicit syncing. Which >>>> people very much don't want. >>> >>> >>> I also do see the problem that most other hardware doesn't need that >>> functionality, because it is driven by a single engine. That's why I >>> tried >>> to keep the overhead as low as possible. >>> >>> But at least for amdgpu (and I strongly suspect for nouveau as well) it >>> is >>> absolutely vital in a number of cases to allow concurrent accesses from >>> the >>> same client even when the BO is then later used with implicit >>> synchronization. >>> >>> This is also the reason why the current workaround is so problematic for >>> us. >>> Cause as soon as the BO is shared with another (non-amdgpu) device all >>> command submissions to it will be serialized even when they come from the >>> same client. >>> >>> Would it be an option extend the concept of the "owner" of the BO amdpgu >>> uses to other drivers as well? >>> >>> When you already have explicit synchronization insider your client, but >>> not >>> between clients (e.g. still uses DRI2 or DRI3), this could also be rather >>> beneficial for others as well. >> >> Again: How you synchronize your driver internal rendering is totally >> up to you. If you see an exclusive fence by amdgpu, and submit new >> rendering by amdgpu, you can totally ignore the exclusive fence. The >> only api contracts for implicit fencing are between drivers for shared >> buffers. If you submit rendering to a shared buffer in parallel, all >> without attaching an exclusive fence that's perfectly fine, but >> somewhen later on, depending upon protocol (glFlush or glxSwapBuffers >> or whatever) you have to collect all those concurrent write hazards >> and bake them into 1 single exclusive fence for implicit fencing. >> >> Atm (and Chris seems to concur) the amdgpu uapi doesn't allow you to >> do that, so for anything shared you have to be super pessimistic. >> Adding a HAND_OFF_FOR_IMPLICIT_FENCING flag/ioctl would probably fix >> that. Only when that flag is set would you take all shared write >> hazards and bake them into one exclusive fence for hand-off to the >> next driver. You'd also need the same when receiving an implicitly >> fenced buffer, to make sure that your concurrent writes do synchronize >> with reading (aka shared fences) done by other drivers. With a bunch >> of trickery and hacks it might be possible to infer this from current >> ioctls even, but you need to be really careful. > > > A new uapi is out of question because we need to be backward compatible. Since when is new uapi out of the question for a performance improvement? >> And you're right that amdgpu seems to be the only (or one of the only) >> drivers which do truly concurrent rendering to the same buffer (not >> just concurrent rendering to multiple buffers all suballocated from >> the same bo). But we can't fix this in the kernel with the tricks you >> propose, because without such an uapi extension (or inference) we >> can't tell the implicit fencing from the explicit fencing case. > > > Sure we can. As I said for amdgpu that is absolutely no problem at all. > > In your terminology all rendering from the same client to a BO is explicitly > fenced, while all rendering from different clients are implicit fenced. > >> And for shared buffers with explicit fencing we _must_ _not_ sync against >> all writes. owner won't help here, because it's still not tracking >> whether something is explicit or implicit synced. > > > Implicit syncing can be disable by giving the > AMDGPU_GEM_CREATE_EXPLICIT_SYNC flag while creating the BO. > >> We've cheated a bit with most other drivers in this area, also because >> we don't have to deal with truly concurrent rendering. > > > Yeah, absolutely nailed it. And this cheating is now completely breaking my > neck because it doesn't work well at all with the requirements I have at > hand here. > >> So it's not >> obvious that we're not tracking writes/reads, but implicit/explicit >> fencing. But semantically we track the later for shared buffers. >> >> Cheers, Daniel >> >> PS: One idea I have for inference: Every time you see a shared buffer >> in an amdgpu CS: >> 1. Grab reservation lock >> 2. Check all the fences' creators. If any of them are foreign (not by >> amdgpu), then run the current pessimistic code. > > > That is exactly what we already do. > >> 3. If all fences are by amdgpu >> - Look at the exclusive fence. If it's a ttm bo move keep it, if it's >> marked as a special implicit syncing fence, ignore it. >> - Run all CS concurrently by storing all their write fences in the shared >> slots. >> - Create a fake exclusive fence which ties all the write hazards into >> one fence. Mark them as special implicit syncing fences in your >> amdgpu_fence struct. This will make sure other drivers sync properly, >> but since you ignore these special it won't reduce amdgpu-internal >> concurrency. > > > That won't work, adding the exclusive fence removes all shared fences and I > still need to have those for the sync check above. > > I need something like a callback from other drivers that the reservation > object is now used by them and no longer by amdgpu. Then don't track _any_ of the amdgpu internal fences in the reservation object: - 1 reservation object that you hand to ttm, for use internally within amdgpu - 1 reservation object that you attach to the dma-buf (or get from the imported dma-buf), where you play all the tricks to fake fences. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch