One thing I've forgotten, since it's only hinted at here: If/when we switch tlb flushing from the current dumb&synchronous implementation we now have in i915 in upstream to one with batching using dma_fence, then I think that should be something which is done with a small helper library of shared code too. The batching is somewhat tricky, and you need to make sure you put the fence into the right dma_resv_usage slot, and the trick with replace the vm fence with a tlb flush fence is also a good reason to share the code so we only have it one. Christian's recent work also has some prep work for this already with the fence replacing trick. -Daniel On Thu, 31 Mar 2022 at 10:28, Daniel Vetter <daniel@xxxxxxxx> wrote: > Adding a pile of people who've expressed interest in vm_bind for their > drivers. > > Also note to the intel folks: This is largely written with me having my > subsystem co-maintainer hat on, i.e. what I think is the right thing to do > here for the subsystem at large. There is substantial rework involved > here, but it's not any different from i915 adopting ttm or i915 adpoting > drm/sched, and I do think this stuff needs to happen in one form or > another. > > On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote: > > VM_BIND design document with description of intended use cases. > > > > Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@xxxxxxxxx> > > --- > > Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ > > Documentation/gpu/rfc/index.rst | 4 + > > 2 files changed, 214 insertions(+) > > create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst > > > > diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst > > new file mode 100644 > > index 000000000000..cdc6bb25b942 > > --- /dev/null > > +++ b/Documentation/gpu/rfc/i915_vm_bind.rst > > @@ -0,0 +1,210 @@ > > +========================================== > > +I915 VM_BIND feature design and use cases > > +========================================== > > + > > +VM_BIND feature > > +================ > > +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer > > +objects (BOs) or sections of a BOs at specified GPU virtual addresses on > > +a specified address space (VM). > > + > > +These mappings (also referred to as persistent mappings) will be persistent > > +across multiple GPU submissions (execbuff) issued by the UMD, without user > > +having to provide a list of all required mappings during each submission > > +(as required by older execbuff mode). > > + > > +VM_BIND ioctl deferes binding the mappings until next execbuff submission > > +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE > > +flag is set (useful if mapping is required for an active context). > > So this is a screw-up I've done, and for upstream I think we need to fix > it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and > I was wrong suggesting we should do this a few years back when we kicked > this off internally :-( > > What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few > things on top: > - in and out fences, like with execbuf, to allow userspace to sync with > execbuf as needed > - for compute-mode context this means userspace memory fences > - for legacy context this means a timeline syncobj in drm_syncobj > > No sync_file or anything else like this at all. This means a bunch of > work, but also it'll have benefits because it means we should be able to > use exactly the same code paths and logic for both compute and for legacy > context, because drm_syncobj support future fence semantics. > > Also on the implementation side we still need to install dma_fence to the > various dma_resv, and for this we need the new dma_resv_usage series from > Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING > flag to make sure they never result in an oversync issue with execbuf. I > don't think trying to land vm_bind without that prep work in > dma_resv_usage makes sense. > > Also as soon as dma_resv_usage has landed there's a few cleanups we should > do in i915: > - ttm bo moving code should probably simplify a bit (and maybe more of the > code should be pushed as helpers into ttm) > - clflush code should be moved over to using USAGE_KERNEL and the various > hacks and special cases should be ditched. See df94fd05e69e ("drm/i915: > expand on the kernel-doc for cache_dirty") for a bit more context > > This is still not yet enough, since if a vm_bind races with an eviction we > might stall on the new buffers being readied first before the context can > continue. This needs some care to make sure that vma which aren't fully > bound yet are on a separate list, and vma which are marked for unbinding > are removed from the main working set list as soon as possible. > > All of these things are relevant for the uapi semantics, which means > - they need to be documented in the uapi kerneldoc, ideally with example > flows > - umd need to ack this > > The other thing here is the async/nonblocking path. I think we still need > that one, but again it should not sync with anything going on in execbuf, > but simply execute the ioctl code in a kernel thread. The idea here is > that this works like a special gpu engine, so that compute and vk can > schedule bindings interleaved with rendering. This should be enough to get > a performant vk sparse binding/textures implementation. > > But I'm not entirely sure on this one, so this definitely needs acks from > umds. > > > +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. > > +User has to opt-in for VM_BIND mode of binding for an address space (VM) > > +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. > > +A VM in VM_BIND mode will not support older execbuff mode of binding. > > + > > +UMDs can still send BOs of these persistent mappings in execlist of execbuff > > +for specifying BO dependencies (implicit fencing) and to use BO as a batch, > > +but those BOs should be mapped ahead via vm_bind ioctl. > > should or must? > > Also I'm not really sure that's a great interface. The batchbuffer really > only needs to be an address, so maybe all we need is an extension to > supply an u64 batchbuffer address instead of trying to retrofit this into > an unfitting current uapi. > > And for implicit sync there's two things: > - for vk I think the right uapi is the dma-buf fence import/export ioctls > from Jason Ekstrand. I think we should land that first instead of > hacking funny concepts together > - for gl the dma-buf import/export might not be fast enough, since gl > needs to do a _lot_ of implicit sync. There we might need to use the > execbuffer buffer list, but then we should have extremely clear uapi > rules which disallow _everything_ except setting the explicit sync uapi > > Again all this stuff needs to be documented in detail in the kerneldoc > uapi spec. > > > +VM_BIND features include, > > +- Multiple Virtual Address (VA) mappings can map to the same physical pages > > + of an object (aliasing). > > +- VA mapping can map to a partial section of the BO (partial binding). > > +- Support capture of persistent mappings in the dump upon GPU error. > > +- TLB is flushed upon unbind completion. Batching of TLB flushes in some > > + usecases will be helpful. > > +- Asynchronous vm_bind and vm_unbind support. > > +- VM_BIND uses user/memory fence mechanism for signaling bind completion > > + and for signaling batch completion in long running contexts (explained > > + below). > > This should all be in the kerneldoc. > > > +VM_PRIVATE objects > > +------------------ > > +By default, BOs can be mapped on multiple VMs and can also be dma-buf > > +exported. Hence these BOs are referred to as Shared BOs. > > +During each execbuff submission, the request fence must be added to the > > +dma-resv fence list of all shared BOs mapped on the VM. > > + > > +VM_BIND feature introduces an optimization where user can create BO which > > +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during > > +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on > > +the VM they are private to and can't be dma-buf exported. > > +All private BOs of a VM share the dma-resv object. Hence during each execbuff > > +submission, they need only one dma-resv fence list updated. Thus the fast > > +path (where required mappings are already bound) submission latency is O(1) > > +w.r.t the number of VM private BOs. > > Two things: > > - I think the above is required to for initial vm_bind for vk, it kinda > doesn't make much sense without that, and will allow us to match amdgpu > and radeonsi > > - Christian König just landed ttm bulk lru helpers, and I think we need to > use those. This means vm_bind will only work with the ttm backend, but > that's what we have for the big dgpu where vm_bind helps more in terms > of performance, and the igfx conversion to ttm is already going on. > > Furthermore the i915 shrinker lru has stopped being an lru, so I think > that should also be moved over to the ttm lru in some fashion to make sure > we once again have a reasonable and consistent memory aging and reclaim > architecture. The current code is just too much of a complete mess. > > And since this is all fairly integral to how the code arch works I don't > think merging a different version which isn't based on ttm bulk lru > helpers makes sense. > > Also I do think the page table lru handling needs to be included here, > because that's another complete hand-rolled separate world for not much > good reasons. I guess that can happen in parallel with the initial vm_bind > bring-up, but it needs to be completed by the time we add the features > beyond the initial support needed for vk. > > > +VM_BIND locking hirarchy > > +------------------------- > > +VM_BIND locking order is as below. > > + > > +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/ > > + vm_unbind ioctl calls, in the execbuff path and while releasing the mapping. > > + > > + In future, when GPU page faults are supported, we can potentially use a > > + rwsem instead, so that multiple pagefault handlers can take the read side > > + lock to lookup the mapping and hence can run in parallel. > > + > > +2) The BO's dma-resv lock will protect i915_vma state and needs to be held > > + while binding a vma and while updating dma-resv fence list of a BO. > > + The private BOs of a VM will all share a dma-resv object. > > + > > + This lock is held in vm_bind call for immediate binding, during vm_unbind > > + call for unbinding and during execbuff path for binding the mapping and > > + updating the dma-resv fence list of the BO. > > + > > +3) Spinlock/s to protect some of the VM's lists. > > + > > +We will also need support for bluk LRU movement of persistent mapping to > > +avoid additional latencies in execbuff path. > > This needs more detail and explanation of how each level is required. Also > the shared dma_resv for VM_PRIVATE objects is kinda important to explain. > > Like "some of the VM's lists" explains pretty much nothing. > > > + > > +GPU page faults > > +---------------- > > +Both older execbuff mode and the newer VM_BIND mode of binding will require > > +using dma-fence to ensure residency. > > +In future when GPU page faults are supported, no dma-fence usage is required > > +as residency is purely managed by installing and removing/invalidating ptes. > > This is a bit confusing. I think one part of this should be moved into the > section with future vm_bind use-cases (we're not going to support page > faults with legacy softpin or even worse, relocations). The locking > discussion should be part of the much longer list of uses cases that > motivate the locking design. > > > + > > + > > +User/Memory Fence > > +================== > > +The idea is to take a user specified virtual address and install an interrupt > > +handler to wake up the current task when the memory location passes the user > > +supplied filter. > > + > > +User/Memory fence is a <address, value> pair. To signal the user fence, > > +specified value will be written at the specified virtual address and > > +wakeup the waiting process. User can wait on an user fence with the > > +gem_wait_user_fence ioctl. > > + > > +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify > > +interrupt within their batches after updating the value to have sub-batch > > +precision on the wakeup. Each batch can signal an user fence to indicate > > +the completion of next level batch. The completion of very first level batch > > +needs to be signaled by the command streamer. The user must provide the > > +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE > > +extension of execbuff ioctl, so that KMD can setup the command streamer to > > +signal it. > > + > > +User/Memory fence can also be supplied to the kernel driver to signal/wake up > > +the user process after completion of an asynchronous operation. > > + > > +When VM_BIND ioctl was provided with a user/memory fence via the > > +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion > > +of binding of that mapping. All async binds/unbinds are serialized, hence > > +signaling of user/memory fence also indicate the completion of all previous > > +binds/unbinds. > > + > > +This feature will be derived from the below original work: > > +https://patchwork.freedesktop.org/patch/349417/ > > This is 1:1 tied to long running compute mode contexts (which in the uapi > doc must reference the endless amounts of bikeshed summary we have in the > docs about indefinite fences). > > I'd put this into a new section about compute and userspace memory fences > support, with this and the next chapter ... > > + > > + > > +VM_BIND use cases > > +================== > > ... and then make this section here focus entirely on additional vm_bind > use-cases that we'll be adding later on. Which doesn't need to go into any > details, it's just justification for why we want to build the world on top > of vm_bind. > > > + > > +Long running Compute contexts > > +------------------------------ > > +Usage of dma-fence expects that they complete in reasonable amount of time. > > +Compute on the other hand can be long running. Hence it is appropriate for > > +compute to use user/memory fence and dma-fence usage will be limited to > > +in-kernel consumption only. This requires an execbuff uapi extension to pass > > +in user fence. Compute must opt-in for this mechanism with > > +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation. > > + > > +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence > > +and implicit dependency setting is not allowed on long running contexts. > > + > > +Where GPU page faults are not available, kernel driver upon buffer invalidation > > +will initiate a suspend (preemption) of long running context with a dma-fence > > +attached to it. And upon completion of that suspend fence, finish the > > +invalidation, revalidate the BO and then resume the compute context. This is > > +done by having a per-context fence (called suspend fence) proxying as > > +i915_request fence. This suspend fence is enabled when there is a wait on it, > > +which triggers the context preemption. > > + > > +This is much easier to support with VM_BIND compared to the current heavier > > +execbuff path resource attachment. > > There's a bunch of tricky code around compute mode context support, like > the preempt ctx fence (or suspend fence or whatever you want to call it), > and the resume work. And I think that code should be shared across > drivers. > > I think the right place to put this is into drm/sched, somewhere attached > to the drm_sched_entity structure. I expect i915 folks to collaborate with > amd and ideally also get amdkfd to adopt the same thing if possible. At > least Christian has mentioned in the past that he's a bit unhappy about > how this works. > > Also drm/sched has dependency tracking, which will be needed to pipeline > context resume operations. That needs to be used instead of i915-gem > inventing yet another dependency tracking data structure (it already has 3 > and that's roughly 3 too many). > > This means compute mode support and userspace memory fences are blocked on > the drm/sched conversion, but *eh* add it to the list of reasons for why > drm/sched needs to happen. > > Also since we only have support for compute mode ctx in our internal tree > with the guc scheduler backend anyway, and the first conversion target is > the guc backend, I don't think this actually holds up a lot of the code. > > > +Low Latency Submission > > +----------------------- > > +Allows compute UMD to directly submit GPU jobs instead of through execbuff > > +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs. > > This is really just a special case of compute mode contexts, I think I'd > include that in there, but explain better what it requires (i.e. vm_bind > not being synchronized against execbuf). > > > + > > +Debugger > > +--------- > > +With debug event interface user space process (debugger) is able to keep track > > +of and act upon resources created by another process (debuggee) and attached > > +to GPU via vm_bind interface. > > + > > +Mesa/Valkun > > +------------ > > +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving > > +performance. For Vulkan it should be straightforward to use VM_BIND. > > +For Iris implicit buffer tracking must be implemented before we can harness > > +VM_BIND benefits. With increasing GPU hardware performance reducing CPU > > +overhead becomes more important. > > Just to clarify, I don't think we can land vm_bind into upstream if it > doesn't work 100% for vk. There's a bit much "can" instead of "will in > this section". > > > + > > +Page level hints settings > > +-------------------------- > > +VM_BIND allows any hints setting per mapping instead of per BO. > > +Possible hints include read-only, placement and atomicity. > > +Sub-BO level placement hint will be even more relevant with > > +upcoming GPU on-demand page fault support. > > + > > +Page level Cache/CLOS settings > > +------------------------------- > > +VM_BIND allows cache/CLOS settings per mapping instead of per BO. > > + > > +Shared Virtual Memory (SVM) support > > +------------------------------------ > > +VM_BIND interface can be used to map system memory directly (without gem BO > > +abstraction) using the HMM interface. > > Userptr is absent here (and it's not the same as svm, at least on > discrete), and this is needed for the initial version since otherwise vk > can't use it because we're not at feature parity. > > Irc discussions by Maarten and Dave came up with the idea that maybe > userptr for vm_bind should work _without_ any gem bo as backing storage, > since that guarantees that people don't come up with funny ideas like > trying to share such bo across process or mmap it and other nonsense which > just doesn't work. > > > + > > + > > +Broder i915 cleanups > > +===================== > > +Supporting this whole new vm_bind mode of binding which comes with its own > > +usecases to support and the locking requirements requires proper integration > > +with the existing i915 driver. This calls for some broader i915 driver > > +cleanups/simplifications for maintainability of the driver going forward. > > +Here are few things identified and are being looked into. > > + > > +- Make pagetable allocations evictable and manage them similar to VM_BIND > > + mapped objects. Page table pages are similar to persistent mappings of a > > + VM (difference here are that the page table pages will not > > + have an i915_vma structure and after swapping pages back in, parent page > > + link needs to be updated). > > See above, but I think this should be included as part of the initial > vm_bind push. > > > +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature > > + do not use it and complexity it brings in is probably more than the > > + performance advantage we get in legacy execbuff case. > > +- Remove vma->open_count counting > > +- Remove i915_vma active reference tracking. Instead use underlying BO's > > + dma-resv fence list to determine if a i915_vma is active or not. > > So this is a complete mess, and really should not exist. I think it needs > to be removed before we try to make i915_vma even more complex by adding > vm_bind. > > The other thing I've been pondering here is that vm_bind is really > completely different from legacy vm structures for a lot of reasons: > - no relocation or softpin handling, which means vm_bind has no reason to > ever look at the i915_vma structure in execbuf code. Unfortunately > execbuf has been rewritten to be vma instead of obj centric, so it's a > 100% mismatch > > - vm_bind never has to manage any vm lru. Legacy execbuf has to maintain > that because the kernel manages the virtual address space fully. Again > ideally that entire vma_move_to_active code and everything related to it > would simply not exist. > > - similar on the eviction side, the rules are quite different: For vm_bind > we never tear down the vma, instead it's just moved to the list of > evicted vma. Legacy vm have no need for all these additional lists, so > another huge confusion. > > - if the refcount is done correctly for vm_bind we wouldn't need the > tricky code in the bo close paths. Unfortunately legacy vm with > relocations and softpin require that vma are only a weak reference, so > that cannot be removed. > > - there's also a ton of special cases for ggtt handling, like the > different views (for display, partial views for mmap), but also the > gen2/3 alignment and padding requirements which vm_bind never needs. > > I think the right thing here is to massively split the implementation > behind some solid vm/vma abstraction, with a base clase for vm and vma > which _only_ has the pieces which both vm_bind and the legacy vm stuff > needs. But it's a bit tricky to get there. I think a workable path would > be: > - Add a new base class to both i915_address_space and i915_vma, which > starts out empty. > > - As vm_bind code lands, move things that vm_bind code needs into these > base classes > > - The goal should be that these base classes are a stand-alone library > that other drivers could reuse. Like we've done with the buddy > allocator, which first moved from i915-gem to i915-ttm, and which amd > now moved to drm/ttm for reuse by amdgpu. Ideally other drivers > interested in adding something like vm_bind should be involved from the > start (or maybe the entire thing reused in amdgpu, they're looking at > vk sparse binding support too or at least have perf issues I think). > > - Locking must be the same across all implemntations, otherwise it's > really not an abstract. i915 screwed this up terribly by having > different locking rules for ppgtt and ggtt, which is just nonsense. > > - The legacy specific code needs to be extracted as much as possible and > shoved into separate files. In execbuf this means we need to get back to > object centric flow, and the slowpaths need to become a lot simpler > again (Maarten has cleaned up some of this, but there's still a silly > amount of hacks in there with funny layering). > > - I think if stuff like the vma eviction details (list movement and > locking and refcounting of the underlying object) > > > + > > +These can be worked upon after intitial vm_bind support is added. > > I don't think that works, given how badly i915-gem team screwed up in > other places. And those places had to be fixed by adopting shared code > like ttm. Plus there's already a huge unfulffiled promise pending with the > drm/sched conversion, i915-gem team is clearly deeply in the red here :-/ > > Cheers, Daniel > > > + > > + > > +UAPI > > +===== > > +Uapi definiton can be found here: > > +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h > > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst > > index 91e93a705230..7d10c36b268d 100644 > > --- a/Documentation/gpu/rfc/index.rst > > +++ b/Documentation/gpu/rfc/index.rst > > @@ -23,3 +23,7 @@ host such documentation: > > .. toctree:: > > > > i915_scheduler.rst > > + > > +.. toctree:: > > + > > + i915_vm_bind.rst > > -- > > 2.21.0.rc0.32.g243a4c7e27 > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch