Re: [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Tue, 23 Jun 2020 15:01:01 +0100

Quoting Thomas Hellström (Intel) (2020-06-23 13:57:06)
> 
> On 6/23/20 1:22 PM, Thomas Hellström (Intel) wrote:
> > Hi, Chris,
> >
> > On 6/22/20 11:59 AM, Chris Wilson wrote:
> >> In order to actually handle eviction and what not, we need to process
> >> all the objects together under a common lock, reservation_ww_class. As
> >> such, do a memory reservation pass after looking up the object/vma,
> >> which then feeds into the rest of execbuf [relocation, cmdparsing,
> >> flushing and ofc execution].
> >>
> >> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> >> ---
> >>   .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 91 ++++++++++++++-----
> >>   1 file changed, 70 insertions(+), 21 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
> >> b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> >> index 46fcbdf8161c..8db2e013465f 100644
> >> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> >> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> >> @@ -53,10 +53,9 @@ struct eb_vma_array {
> >>     #define __EXEC_OBJECT_HAS_PIN        BIT(31)
> >>   #define __EXEC_OBJECT_HAS_FENCE        BIT(30)
> >> -#define __EXEC_OBJECT_HAS_PAGES        BIT(29)
> >> -#define __EXEC_OBJECT_NEEDS_MAP        BIT(28)
> >> -#define __EXEC_OBJECT_NEEDS_BIAS    BIT(27)
> >> -#define __EXEC_OBJECT_INTERNAL_FLAGS    (~0u << 27) /* all of the 
> >> above */
> >> +#define __EXEC_OBJECT_NEEDS_MAP        BIT(29)
> >> +#define __EXEC_OBJECT_NEEDS_BIAS    BIT(28)
> >> +#define __EXEC_OBJECT_INTERNAL_FLAGS    (~0u << 28) /* all of the 
> >> above */
> >>     #define __EXEC_HAS_RELOC    BIT(31)
> >>   #define __EXEC_INTERNAL_FLAGS    (~0u << 31)
> >> @@ -241,6 +240,8 @@ struct i915_execbuffer {
> >>       struct intel_context *context; /* logical state for the request */
> >>       struct i915_gem_context *gem_context; /** caller's context */
> >>   +    struct dma_fence *mm_fence;
> >> +
> >>       struct i915_request *request; /** our request to build */
> >>       struct eb_vma *batch; /** identity of the batch obj/vma */
> >>       struct i915_vma *trampoline; /** trampoline used for chaining */
> >> @@ -331,12 +332,7 @@ static inline void eb_unreserve_vma(struct 
> >> eb_vma *ev)
> >>       if (ev->flags & __EXEC_OBJECT_HAS_PIN)
> >>           __i915_vma_unpin(vma);
> >>   -    if (ev->flags & __EXEC_OBJECT_HAS_PAGES)
> >> -        i915_gem_object_unpin_pages(vma->obj);
> >> -
> >> -    ev->flags &= ~(__EXEC_OBJECT_HAS_PIN |
> >> -               __EXEC_OBJECT_HAS_FENCE |
> >> -               __EXEC_OBJECT_HAS_PAGES);
> >> +    ev->flags &= ~(__EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_FENCE);
> >>   }
> >>     static void eb_vma_array_destroy(struct kref *kref)
> >> @@ -667,6 +663,55 @@ eb_add_vma(struct i915_execbuffer *eb,
> >>       list_add_tail(&ev->lock_link, &eb->lock);
> >>   }
> >>   +static int eb_vma_get_pages(struct i915_execbuffer *eb,
> >> +                struct eb_vma *ev,
> >> +                u64 idx)
> >> +{
> >> +    struct i915_vma *vma = ev->vma;
> >> +    int err;
> >> +
> >> +    /* XXX also preallocate PD for vma */
> >> +
> >> +    err = ____i915_gem_object_get_pages_async(vma->obj);
> >> +    if (err)
> >> +        return err;
> >> +
> >> +    return i915_active_ref(&vma->obj->mm.active, idx, eb->mm_fence);
> >> +}
> >> +
> >> +static int eb_reserve_mm(struct i915_execbuffer *eb)
> >> +{
> >> +    const u64 idx = eb->context->timeline->fence_context;
> >> +    struct ww_acquire_ctx acquire;
> >> +    struct eb_vma *ev;
> >> +    int err;
> >> +
> >> +    eb->mm_fence = __dma_fence_create_proxy(0, 0);
> >> +    if (!eb->mm_fence)
> >> +        return -ENOMEM;
> >
> > Question: eb is local to this thread, right, so eb->mm_fence is not 
> > considered "published" yet?
> >
> >> +
> >> +    ww_acquire_init(&acquire, &reservation_ww_class);
> >> +
> >> +    err = eb_lock_vma(eb, &acquire);
> >> +    if (err)
> >> +        goto out;
> >> +
> >> +    ww_acquire_done(&acquire);
> >> +
> >> +    list_for_each_entry(ev, &eb->lock, lock_link) {
> >> +        struct i915_vma *vma = ev->vma;
> >> +
> >> +        if (err == 0)
> >> +            err = eb_vma_get_pages(eb, ev, idx);
> >
> > I figure this is where you publish the proxy fence? If so, the fence 
> > signaling critical path starts with this loop, and that means any code 
> > we call between here and submission complete (including spawned work 
> > we need to wait for before submission) may not lock the 
> > reservation_ww_class nor (still being discussed) allocate memory.

Yes, at this point we have reserved the memory for the execbuf.

> > It 
> > looks like i915_pin_vma takes a reservation_ww_class. And all memory 
> > pinning seems to be in the fence critical path as well?

Correct, it's not meant to be waiting inside i915_vma_pin(); the
intention was to pass in memory, and then we would not need to
do the acquire ourselves. As we have just reserved the memory in the
above loop, this should not be an issue. I was trying to keep the
change minimal and allow incremental conversions. It does however need
to add a reference to the object for the work it spawns -- equally
though there is an async eviction pass later in execbuf. The challenge
here is that the greedy grab of bound vma is faster than doing the
unbound eviction handling (even when eviction is not required).

> And I think even if we at some point end up with the allocation 
> annotation the other way around, allowing memory allocations in fence 
> signalling critical paths, both relocations and userpointer would cause 
> lockdep problems because of
> 
> mmap_sem->reservation_object->fence_wait (fault handlers, lockdep priming)

We don't wait inside mmap_sem. One cannot, you do not know the locking
context, so you can only try to reclaim idle space. So you end up with
the issue of a multitude of threads each trying to claim the last slice
of the aperture/backing storage, not being able to directly reclaim and
so have to hit the equivalent of kswapd.

> vs
> fence_critical->gup/copy_from_user->mmap_sem

Which exists today, even the busy wait loop is implicit linkage; you only
need userspace to be holding a resource on the gpu to create the deadlock.
I've been using the userfault handler to develop test cases where we can
arbitrarily block the userptr.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx