Re: [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Thomas Hellström (Intel) <thomas_os@xxxxxxxxxxxx> · Tue, 23 Jun 2020 17:09:08 +0200

On 6/23/20 4:01 PM, Chris Wilson wrote:
Quoting Thomas Hellström (Intel) (2020-06-23 13:57:06)
On 6/23/20 1:22 PM, Thomas Hellström (Intel) wrote:
Hi, Chris,

On 6/22/20 11:59 AM, Chris Wilson wrote:
In order to actually handle eviction and what not, we need to process
all the objects together under a common lock, reservation_ww_class. As
such, do a memory reservation pass after looking up the object/vma,
which then feeds into the rest of execbuf [relocation, cmdparsing,
flushing and ofc execution].

Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
---
   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 91 ++++++++++++++-----
   1 file changed, 70 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 46fcbdf8161c..8db2e013465f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -53,10 +53,9 @@ struct eb_vma_array {
     #define __EXEC_OBJECT_HAS_PIN        BIT(31)
   #define __EXEC_OBJECT_HAS_FENCE        BIT(30)
-#define __EXEC_OBJECT_HAS_PAGES        BIT(29)
-#define __EXEC_OBJECT_NEEDS_MAP        BIT(28)
-#define __EXEC_OBJECT_NEEDS_BIAS    BIT(27)
-#define __EXEC_OBJECT_INTERNAL_FLAGS    (~0u << 27) /* all of the
above */
+#define __EXEC_OBJECT_NEEDS_MAP        BIT(29)
+#define __EXEC_OBJECT_NEEDS_BIAS    BIT(28)
+#define __EXEC_OBJECT_INTERNAL_FLAGS    (~0u << 28) /* all of the
above */
     #define __EXEC_HAS_RELOC    BIT(31)
   #define __EXEC_INTERNAL_FLAGS    (~0u << 31)
@@ -241,6 +240,8 @@ struct i915_execbuffer {
       struct intel_context *context; /* logical state for the request */
       struct i915_gem_context *gem_context; /** caller's context */
   +    struct dma_fence *mm_fence;
+
       struct i915_request *request; /** our request to build */
       struct eb_vma *batch; /** identity of the batch obj/vma */
       struct i915_vma *trampoline; /** trampoline used for chaining */
@@ -331,12 +332,7 @@ static inline void eb_unreserve_vma(struct
eb_vma *ev)
       if (ev->flags & __EXEC_OBJECT_HAS_PIN)
           __i915_vma_unpin(vma);
   -    if (ev->flags & __EXEC_OBJECT_HAS_PAGES)
-        i915_gem_object_unpin_pages(vma->obj);
-
-    ev->flags &= ~(__EXEC_OBJECT_HAS_PIN |
-               __EXEC_OBJECT_HAS_FENCE |
-               __EXEC_OBJECT_HAS_PAGES);
+    ev->flags &= ~(__EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_FENCE);
   }
     static void eb_vma_array_destroy(struct kref *kref)
@@ -667,6 +663,55 @@ eb_add_vma(struct i915_execbuffer *eb,
       list_add_tail(&ev->lock_link, &eb->lock);
   }
   +static int eb_vma_get_pages(struct i915_execbuffer *eb,
+                struct eb_vma *ev,
+                u64 idx)
+{
+    struct i915_vma *vma = ev->vma;
+    int err;
+
+    /* XXX also preallocate PD for vma */
+
+    err = ____i915_gem_object_get_pages_async(vma->obj);
+    if (err)
+        return err;
+
+    return i915_active_ref(&vma->obj->mm.active, idx, eb->mm_fence);
+}
+
+static int eb_reserve_mm(struct i915_execbuffer *eb)
+{
+    const u64 idx = eb->context->timeline->fence_context;
+    struct ww_acquire_ctx acquire;
+    struct eb_vma *ev;
+    int err;
+
+    eb->mm_fence = __dma_fence_create_proxy(0, 0);
+    if (!eb->mm_fence)
+        return -ENOMEM;
Question: eb is local to this thread, right, so eb->mm_fence is not
considered "published" yet?

+
+    ww_acquire_init(&acquire, &reservation_ww_class);
+
+    err = eb_lock_vma(eb, &acquire);
+    if (err)
+        goto out;
+
+    ww_acquire_done(&acquire);
+
+    list_for_each_entry(ev, &eb->lock, lock_link) {
+        struct i915_vma *vma = ev->vma;
+
+        if (err == 0)
+            err = eb_vma_get_pages(eb, ev, idx);
I figure this is where you publish the proxy fence? If so, the fence
signaling critical path starts with this loop, and that means any code
we call between here and submission complete (including spawned work
we need to wait for before submission) may not lock the
reservation_ww_class nor (still being discussed) allocate memory.
Yes, at this point we have reserved the memory for the execbuf.

It
looks like i915_pin_vma takes a reservation_ww_class. And all memory
pinning seems to be in the fence critical path as well?
Correct, it's not meant to be waiting inside i915_vma_pin(); the
intention was to pass in memory, and then we would not need to
do the acquire ourselves. As we have just reserved the memory in the
above loop, this should not be an issue. I was trying to keep the
change minimal and allow incremental conversions. It does however need
to add a reference to the object for the work it spawns -- equally
though there is an async eviction pass later in execbuf. The challenge
here is that the greedy grab of bound vma is faster than doing the
unbound eviction handling (even when eviction is not required).

So for the i915_vma_pin, it looks like

fence_critical_start(eb_reserve_mm) -> 
dma_resv_lock_interruptible(i915_vma_pin) -> lockdep issue.

You can't take the dma_resv_lock inside a fence critical section.

And for the memory allocation, it looks like the fence is published in 
the first loop iteration, starting the critical section, meaning that 
any memory allocation that follows will cause a lockdep issue. That 
includes worker threads. (with the proposed dma_fence annotations).


And I think even if we at some point end up with the allocation
annotation the other way around, allowing memory allocations in fence
signalling critical paths, both relocations and userpointer would cause
lockdep problems because of

mmap_sem->reservation_object->fence_wait (fault handlers, lockdep priming)
We don't wait inside mmap_sem. One cannot, you do not know the locking
context, so you can only try to reclaim idle space. So you end up with
the issue of a multitude of threads each trying to claim the last slice
of the aperture/backing storage, not being able to directly reclaim and
so have to hit the equivalent of kswapd.

I don't think I follow you here. There are a number of drivers that wait 
for dma_fences inside the fault handlers with mmap_sem held for data to 
be migrated before the pte is set up.


vs
fence_critical->gup/copy_from_user->mmap_sem
Which exists today, even the busy wait loop is implicit linkage; you only
need userspace to be holding a resource on the gpu to create the deadlock.
I've been using the userfault handler to develop test cases where we can
arbitrarily block the userptr.

Yes but in a case where we don't publish the fence early, the above 
would be reduced to the well known reservation_ww_class vs mmap_sem 
lockdep issue, which other drivers seem to have solved and we could copy 
what they've done.

/Thomas




-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx