On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote: > > > Am 02.06.21 um 20:48 schrieb Daniel Vetter: > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@xxxxxxxxx> wrote: > > > > > > > Yes, we can't break anything because we don't want to complicate things > > > > for us. It's pretty much all NAK'd already. We are trying to gather more > > > > knowledge and then make better decisions. > > > > > > > > The idea we are considering is that we'll expose memory-based sync objects > > > > to userspace for read only, and the kernel or hw will strictly control the > > > > memory writes to those sync objects. The hole in that idea is that > > > > userspace can decide not to signal a job, so even if userspace can't > > > > overwrite memory-based sync object states arbitrarily, it can still decide > > > > not to signal them, and then a future fence is born. > > > > > > > This would actually be treated as a GPU hang caused by that context, so it > > > should be fine. > > This is practically what I proposed already, except your not doing it with > > dma_fence. And on the memory fence side this also doesn't actually give > > what you want for that compute model. > > > > This seems like a bit a worst of both worlds approach to me? Tons of work > > in the kernel to hide these not-dma_fence-but-almost, and still pain to > > actually drive the hardware like it should be for compute or direct > > display. > > > > Also maybe I've missed it, but I didn't see any replies to my suggestion > > how to fake the entire dma_fence stuff on top of new hw. Would be > > interesting to know what doesn't work there instead of amd folks going of > > into internal again and then coming back with another rfc from out of > > nowhere :-) > > Well to be honest I would just push back on our hardware/firmware guys that > we need to keep kernel queues forever before going down that route. I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle. I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads). The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently. So summarizing the model I proposed: - you allow userspace to directly write into the ringbuffer, and also write the fences directly - actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences. - the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer - memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context. The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others. Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything. - memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management. - any time a context dies we force-complete all it's pending fences, in-order ofc So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace. None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with: - we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too - dying on any page fault captures memory management issues - dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still > That syncfile and all that Android stuff isn't working out of the box with > the new shiny user queue submission model (which in turn is mostly because > of Windows) already raised some eyebrows here. I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today. For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch