Re: [Intel-gfx] [PATCH v2 1/3] drm/doc/rfc: VM_BIND feature design document

Niranjana Vishwanathapura <niranjana.vishwanathapura@xxxxxxxxx> · Tue, 21 Jun 2022 07:43:29 -0700

On Tue, Jun 21, 2022 at 09:35:16AM +0100, Tvrtko Ursulin wrote:

On 20/06/2022 17:29, Niranjana Vishwanathapura wrote:
On Mon, Jun 20, 2022 at 11:43:10AM +0100, Tvrtko Ursulin wrote:

Hi,

On 17/06/2022 06:14, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.

v2: Reduce the scope to simple Mesa use case.

since I expressed interest please add me to cc when sending out.


Hi Tvrtko,
I did include you in the cc list with git send-email, but looks like 
some patches
in this series has the full cc list, but some don't (you are on cc 
list of this
patch though). I am not sure why.

Odd, I'm not on CC on the (only for me) copy I found in the mailing list.

How come the direction changed to simplify all of a sudden? I did 
not spot any discussion to that effect. Was it internal talks?


Yah, some of us had offline discussion involving the Mesa team.
I did update the thread (previous version of this patch series) about that.
Plan was to align our roadmap to focus on the deliverables at this point
without further complicating the uapi.

Signed-off-by: Niranjana Vishwanathapura 
<niranjana.vishwanathapura@xxxxxxxxx>
---
 Documentation/gpu/rfc/i915_vm_bind.rst | 238 +++++++++++++++++++++++++
 Documentation/gpu/rfc/index.rst        |   4 +
 2 files changed, 242 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst

diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst 
b/Documentation/gpu/rfc/i915_vm_bind.rst
new file mode 100644
index 000000000000..4ab590ef11fd
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_vm_bind.rst
@@ -0,0 +1,238 @@
+==========================================
+I915 VM_BIND feature design and use cases
+==========================================
+
+VM_BIND feature
+================
+DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer
+objects (BOs) or sections of a BOs at specified GPU virtual 
addresses on a
+specified address space (VM). These mappings (also referred to 
as persistent
+mappings) will be persistent across multiple GPU submissions 
(execbuf calls)
+issued by the UMD, without user having to provide a list of all 
required
+mappings during each submission (as required by older execbuf mode).
+
+The VM_BIND/UNBIND calls allow UMDs to request a timeline fence 
for signaling
+the completion of bind/unbind operation.
+
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND.
+User has to opt-in for VM_BIND mode of binding for an address 
space (VM)
+during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension.
+
+Normally, vm_bind/unbind operations will get completed synchronously,

To me synchronously, at this point in the text, reads as ioctl 
will return only when the operation is done. Rest of the paragraph 
however disagrees (plus existence of out fence). It is not clear 
to me what is the actual behaviour. Will it be clear to userspace 
developers reading uapi kerneldoc? If it is async, what are the 
ordering rules in this version?


Yah, here I am simply stating the i915_vma_pin_ww() behavior which mostly
does the binding synchronously unless there is a moving fence associated
with the object in which case, binding will complete later once that fence
is signaled (hence the out fence).

So from userspace point of view it is fully asynchronous and out of 
order? I'd suggest spelling that out in the uapi kerneldoc.


Yah. I can see how some i915 details I provided here can be confusing.
Ok, will remove and spell it out that user must anticipate fully async
out of order completions.

+but if the object is being moved, the binding will happen once that the
+moving is complete and out fence will be signaled after binding 
is complete.
+The bind/unbind operation can get completed out of submission order.
+
+VM_BIND features include:
+
+* Multiple Virtual Address (VA) mappings can map to the same 
physical pages
+  of an object (aliasing).
+* VA mapping can map to a partial section of the BO (partial binding).
+* Support capture of persistent mappings in the dump upon GPU error.
+* TLB is flushed upon unbind completion. Batching of TLB 
flushes in some
+  use cases will be helpful.
+* Support for userptr gem objects (no special uapi is required 
for this).
+
+Execbuf ioctl in VM_BIND mode
+-------------------------------
+A VM in VM_BIND mode will not support older execbuf mode of binding.
+The execbuf ioctl handling in VM_BIND mode differs 
significantly from the
+older execbuf2 ioctl (See struct drm_i915_gem_execbuffer2).
+Hence, a new execbuf3 ioctl has been added to support VM_BIND 
mode. (See
+struct drm_i915_gem_execbuffer3). The execbuf3 ioctl will not 
accept any
+execlist. Hence, no support for implicit sync. It is expected 
that the below
+work will be able to support requirements of object dependency 
setting in all
+use cases:
+
+"dma-buf: Add an API for exporting sync files"
+(https://lwn.net/Articles/859290/)

What does this mean? If execbuf3 does not know about target 
objects how can we add a meaningful fence?


Execbuf3 does know about the target objects. It is all the objects
bound to that VM via vm_bind call.

+
+The execbuf3 ioctl directly specifies the batch addresses instead of as
+object handles as in execbuf2 ioctl. The execbuf3 ioctl will also not
+support many of the older features like in/out/submit fences, 
fence array,
+default gem context and many more (See struct 
drm_i915_gem_execbuffer3).
+
+In VM_BIND mode, VA allocation is completely managed by the 
user instead of
+the i915 driver. Hence all VA assignment, eviction are not 
applicable in
+VM_BIND mode. Also, for determining object activeness, VM_BIND 
mode will not
+be using the i915_vma active reference tracking. It will 
instead use dma-resv
+object for that (See `VM_BIND dma_resv usage`_).
+
+So, a lot of existing code supporting execbuf2 ioctl, like 
relocations, VA
+evictions, vma lookup table, implicit sync, vma active 
reference tracking etc.,
+are not applicable for execbuf3 ioctl. Hence, all execbuf3 
specific handling
+should be in a separate file and only functionalities common to 
these ioctls
+can be the shared code where possible.
+
+VM_PRIVATE objects
+-------------------
+By default, BOs can be mapped on multiple VMs and can also be dma-buf
+exported. Hence these BOs are referred to as Shared BOs.
+During each execbuf submission, the request fence must be added to the
+dma-resv fence list of all shared BOs mapped on the VM.

Does this tie to my previous question? Design is to add each fence 
to literally _all_ BOs mapped to a VM, on every execbuf3? If so, 
is that definitely needed and for what use case? Mixing implicit 
and explicit, I mean bridging implicit and explicit sync clients?


Yes. It is similar to how legacy execbuf2 does. ie., add request fence
to all of the target BOs. Only difference is in execbuf2 case, target
objects are the objects in execlist, whereas in execbuf2, it is all
the BOs mapped to that VM via vm_bind call. It is needed as UMD says
that it is needed by vm_bind'ing the BO before the execbuf3 call.

Sorry I did not understand why it is needed, the last sentence that 
is, what did that suppose to mean?


I am seeing there is a typo in my above comment. It should have been,
"wherewas in execbuf3, it is all the BOs mapped to that VM via vm_bind call".

We need all the BO's dma-resv fence list should be properly updated
as we depend on it for gem_wait ioctl etc. Also note that we are moving
away from i915_vma active tracking mechanism and instead will be checking
the BO's dma-resv fence list to check if BO is active or not.
So, we need the BO's dma-resv fence list properly updated.
As for execbuf3, all the vm_bind BOs are target BOs, we need to update
the dma-resv fence list for all of them (private or shared).

Niranjana

Regards,

Tvrtko

Niranjana

Regards,

Tvrtko

+
+VM_BIND feature introduces an optimization where user can 
create BO which
+is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE 
flag during
+BO creation. Unlike Shared BOs, these VM private BOs can only 
be mapped on
+the VM they are private to and can't be dma-buf exported.
+All private BOs of a VM share the dma-resv object. Hence during 
each execbuf
+submission, they need only one dma-resv fence list updated. 
Thus, the fast
+path (where required mappings are already bound) submission 
latency is O(1)
+w.r.t the number of VM private BOs.
+
+VM_BIND locking hirarchy
+-------------------------
+The locking design here supports the older (execlist based) 
execbuf mode, the
+newer VM_BIND mode, the VM_BIND mode with GPU page faults and 
possible future
+system allocator support (See `Shared Virtual Memory (SVM) support`_).
+The older execbuf mode and the newer VM_BIND mode without page 
faults manages
+residency of backing storage using dma_fence. The VM_BIND mode 
with page faults
+and the system allocator support do not use any dma_fence at all.
+
+VM_BIND locking order is as below.
+
+1) Lock-A: A vm_bind mutex will protect vm_bind lists. This 
lock is taken in
+   vm_bind/vm_unbind ioctl calls, in the execbuf path and while 
releasing the
+   mapping.
+
+   In future, when GPU page faults are supported, we can 
potentially use a
+   rwsem instead, so that multiple page fault handlers can take 
the read side
+   lock to lookup the mapping and hence can run in parallel.
+   The older execbuf mode of binding do not need this lock.
+
+2) Lock-B: The object's dma-resv lock will protect i915_vma 
state and needs to
+   be held while binding/unbinding a vma in the async worker 
and while updating
+   dma-resv fence list of an object. Note that private BOs of a 
VM will all
+   share a dma-resv object.
+
+   The future system allocator support will use the HMM 
prescribed locking
+   instead.
+
+3) Lock-C: Spinlock/s to protect some of the VM's lists like 
the list of
+   invalidated vmas (due to eviction and userptr invalidation) etc.
+
+When GPU page faults are supported, the execbuf path do not 
take any of these
+locks. There we will simply smash the new batch buffer address 
into the ring and
+then tell the scheduler run that. The lock taking only happens 
from the page
+fault handler, where we take lock-A in read mode, whichever 
lock-B we need to
+find the backing storage (dma_resv lock for gem objects, and 
hmm/core mm for
+system allocator) and some additional locks (lock-D) for taking 
care of page
+table races. Page fault mode should not need to ever manipulate 
the vm lists,
+so won't ever need lock-C.
+
+VM_BIND LRU handling
+---------------------
+We need to ensure VM_BIND mapped objects are properly LRU 
tagged to avoid
+performance degradation. We will also need support for bulk LRU 
movement of
+VM_BIND objects to avoid additional latencies in execbuf path.
+
+The page table pages are similar to VM_BIND mapped objects (See
+`Evictable page table allocations`_) and are maintained per VM 
and needs to
+be pinned in memory when VM is made active (ie., upon an 
execbuf call with
+that VM). So, bulk LRU movement of page table pages is also needed.
+
+VM_BIND dma_resv usage
+-----------------------
+Fences needs to be added to all VM_BIND mapped objects. During 
each execbuf
+submission, they are added with DMA_RESV_USAGE_BOOKKEEP usage 
to prevent
+over sync (See enum dma_resv_usage). One can override it with either
+DMA_RESV_USAGE_READ or DMA_RESV_USAGE_WRITE usage during object 
dependency
+setting (either through explicit or implicit mechanism).
+
+When vm_bind is called for a non-private object while the VM is already
+active, the fences need to be copied from VM's shared dma-resv object
+(common to all private objects of the VM) to this non-private object.
+If this results in performance degradation, then some optimization will
+be needed here. This is not a problem for VM's private objects 
as they use
+shared dma-resv object which is always updated on each execbuf 
submission.
+
+Also, in VM_BIND mode, use dma-resv apis for determining object 
activeness
+(See dma_resv_test_signaled() and dma_resv_wait_timeout()) and 
do not use the
+older i915_vma active reference tracking which is deprecated. 
This should be
+easier to get it working with the current TTM backend.
+
+Mesa use case
+--------------
+VM_BIND can potentially reduce the CPU overhead in Mesa (both 
Vulkan and Iris),
+hence improving performance of CPU-bound applications. It also 
allows us to
+implement Vulkan's Sparse Resources. With increasing GPU 
hardware performance,
+reducing CPU overhead becomes more impactful.
+
+
+Other VM_BIND use cases
+========================
+
+Long running Compute contexts
+------------------------------
+Usage of dma-fence expects that they complete in reasonable 
amount of time.
+Compute on the other hand can be long running. Hence it is 
appropriate for
+compute to use user/memory fence (See `User/Memory Fence`_) and 
dma-fence usage
+must be limited to in-kernel consumption only.
+
+Where GPU page faults are not available, kernel driver upon 
buffer invalidation
+will initiate a suspend (preemption) of long running context, 
finish the
+invalidation, revalidate the BO and then resume the compute 
context. This is
+done by having a per-context preempt fence which is enabled 
when someone tries
+to wait on it and triggers the context preemption.
+
+User/Memory Fence
+~~~~~~~~~~~~~~~~~~
+User/Memory fence is a <address, value> pair. To signal the 
user fence, the
+specified value will be written at the specified virtual 
address and wakeup the
+waiting process. User fence can be signaled either by the GPU 
or kernel async
+worker (like upon bind completion). User can wait on a user 
fence with a new
+user fence wait ioctl.
+
+Here is some prior work on this:
+https://patchwork.freedesktop.org/patch/349417/
+
+Low Latency Submission
+~~~~~~~~~~~~~~~~~~~~~~~
+Allows compute UMD to directly submit GPU jobs instead of 
through execbuf
+ioctl. This is made possible by VM_BIND is not being 
synchronized against
+execbuf. VM_BIND allows bind/unbind of mappings required for 
the directly
+submitted jobs.
+
+Debugger
+---------
+With debug event interface user space process (debugger) is 
able to keep track
+of and act upon resources created by another process (debugged) 
and attached
+to GPU via vm_bind interface.
+
+GPU page faults
+----------------
+GPU page faults when supported (in future), will only be 
supported in the
+VM_BIND mode. While both the older execbuf mode and the newer 
VM_BIND mode of
+binding will require using dma-fence to ensure residency, the 
GPU page faults
+mode when supported, will not use any dma-fence as residency is 
purely managed
+by installing and removing/invalidating page table entries.
+
+Page level hints settings
+--------------------------
+VM_BIND allows any hints setting per mapping instead of per BO.
+Possible hints include read-only mapping, placement and atomicity.
+Sub-BO level placement hint will be even more relevant with
+upcoming GPU on-demand page fault support.
+
+Page level Cache/CLOS settings
+-------------------------------
+VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+
+Evictable page table allocations
+---------------------------------
+Make pagetable allocations evictable and manage them similar to VM_BIND
+mapped objects. Page table pages are similar to persistent 
mappings of a
+VM (difference here are that the page table pages will not have 
an i915_vma
+structure and after swapping pages back in, parent page link 
needs to be
+updated).
+
+Shared Virtual Memory (SVM) support
+------------------------------------
+VM_BIND interface can be used to map system memory directly 
(without gem BO
+abstraction) using the HMM interface. SVM is only supported 
with GPU page
+faults enabled.
+
+VM_BIND UAPI
+=============
+
+.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/index.rst 
b/Documentation/gpu/rfc/index.rst
index 91e93a705230..7d10c36b268d 100644
--- a/Documentation/gpu/rfc/index.rst
+++ b/Documentation/gpu/rfc/index.rst
@@ -23,3 +23,7 @@ host such documentation:
 .. toctree::
     i915_scheduler.rst
+
+.. toctree::
+
+    i915_vm_bind.rst