Re: [PATCH 7/7] drm/i915/gem: Acquire all vma/objects under reservation_ww_class

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Tue, 23 Jun 2020 17:00:31 +0100

Quoting Thomas Hellström (Intel) (2020-06-23 16:09:08)
> 
> On 6/23/20 4:01 PM, Chris Wilson wrote:
> > Quoting Thomas Hellström (Intel) (2020-06-23 13:57:06)
> >> On 6/23/20 1:22 PM, Thomas Hellström (Intel) wrote:
> >>> Hi, Chris,
> >>>
> >>> On 6/22/20 11:59 AM, Chris Wilson wrote:
> >>>> In order to actually handle eviction and what not, we need to process
> >>>> all the objects together under a common lock, reservation_ww_class. As
> >>>> such, do a memory reservation pass after looking up the object/vma,
> >>>> which then feeds into the rest of execbuf [relocation, cmdparsing,
> >>>> flushing and ofc execution].
> >>>>
> >>>> Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
> >>>> ---
> >>>>    .../gpu/drm/i915/gem/i915_gem_execbuffer.c    | 91 ++++++++++++++-----
> >>>>    1 file changed, 70 insertions(+), 21 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> >>>> b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> >>>> index 46fcbdf8161c..8db2e013465f 100644
> >>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> >>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
> >>>> @@ -53,10 +53,9 @@ struct eb_vma_array {
> >>>>      #define __EXEC_OBJECT_HAS_PIN        BIT(31)
> >>>>    #define __EXEC_OBJECT_HAS_FENCE        BIT(30)
> >>>> -#define __EXEC_OBJECT_HAS_PAGES        BIT(29)
> >>>> -#define __EXEC_OBJECT_NEEDS_MAP        BIT(28)
> >>>> -#define __EXEC_OBJECT_NEEDS_BIAS    BIT(27)
> >>>> -#define __EXEC_OBJECT_INTERNAL_FLAGS    (~0u << 27) /* all of the
> >>>> above */
> >>>> +#define __EXEC_OBJECT_NEEDS_MAP        BIT(29)
> >>>> +#define __EXEC_OBJECT_NEEDS_BIAS    BIT(28)
> >>>> +#define __EXEC_OBJECT_INTERNAL_FLAGS    (~0u << 28) /* all of the
> >>>> above */
> >>>>      #define __EXEC_HAS_RELOC    BIT(31)
> >>>>    #define __EXEC_INTERNAL_FLAGS    (~0u << 31)
> >>>> @@ -241,6 +240,8 @@ struct i915_execbuffer {
> >>>>        struct intel_context *context; /* logical state for the request */
> >>>>        struct i915_gem_context *gem_context; /** caller's context */
> >>>>    +    struct dma_fence *mm_fence;
> >>>> +
> >>>>        struct i915_request *request; /** our request to build */
> >>>>        struct eb_vma *batch; /** identity of the batch obj/vma */
> >>>>        struct i915_vma *trampoline; /** trampoline used for chaining */
> >>>> @@ -331,12 +332,7 @@ static inline void eb_unreserve_vma(struct
> >>>> eb_vma *ev)
> >>>>        if (ev->flags & __EXEC_OBJECT_HAS_PIN)
> >>>>            __i915_vma_unpin(vma);
> >>>>    -    if (ev->flags & __EXEC_OBJECT_HAS_PAGES)
> >>>> -        i915_gem_object_unpin_pages(vma->obj);
> >>>> -
> >>>> -    ev->flags &= ~(__EXEC_OBJECT_HAS_PIN |
> >>>> -               __EXEC_OBJECT_HAS_FENCE |
> >>>> -               __EXEC_OBJECT_HAS_PAGES);
> >>>> +    ev->flags &= ~(__EXEC_OBJECT_HAS_PIN | __EXEC_OBJECT_HAS_FENCE);
> >>>>    }
> >>>>      static void eb_vma_array_destroy(struct kref *kref)
> >>>> @@ -667,6 +663,55 @@ eb_add_vma(struct i915_execbuffer *eb,
> >>>>        list_add_tail(&ev->lock_link, &eb->lock);
> >>>>    }
> >>>>    +static int eb_vma_get_pages(struct i915_execbuffer *eb,
> >>>> +                struct eb_vma *ev,
> >>>> +                u64 idx)
> >>>> +{
> >>>> +    struct i915_vma *vma = ev->vma;
> >>>> +    int err;
> >>>> +
> >>>> +    /* XXX also preallocate PD for vma */
> >>>> +
> >>>> +    err = ____i915_gem_object_get_pages_async(vma->obj);
> >>>> +    if (err)
> >>>> +        return err;
> >>>> +
> >>>> +    return i915_active_ref(&vma->obj->mm.active, idx, eb->mm_fence);
> >>>> +}
> >>>> +
> >>>> +static int eb_reserve_mm(struct i915_execbuffer *eb)
> >>>> +{
> >>>> +    const u64 idx = eb->context->timeline->fence_context;
> >>>> +    struct ww_acquire_ctx acquire;
> >>>> +    struct eb_vma *ev;
> >>>> +    int err;
> >>>> +
> >>>> +    eb->mm_fence = __dma_fence_create_proxy(0, 0);
> >>>> +    if (!eb->mm_fence)
> >>>> +        return -ENOMEM;
> >>> Question: eb is local to this thread, right, so eb->mm_fence is not
> >>> considered "published" yet?
> >>>
> >>>> +
> >>>> +    ww_acquire_init(&acquire, &reservation_ww_class);
> >>>> +
> >>>> +    err = eb_lock_vma(eb, &acquire);
> >>>> +    if (err)
> >>>> +        goto out;
> >>>> +
> >>>> +    ww_acquire_done(&acquire);
> >>>> +
> >>>> +    list_for_each_entry(ev, &eb->lock, lock_link) {
> >>>> +        struct i915_vma *vma = ev->vma;
> >>>> +
> >>>> +        if (err == 0)
> >>>> +            err = eb_vma_get_pages(eb, ev, idx);
> >>> I figure this is where you publish the proxy fence? If so, the fence
> >>> signaling critical path starts with this loop, and that means any code
> >>> we call between here and submission complete (including spawned work
> >>> we need to wait for before submission) may not lock the
> >>> reservation_ww_class nor (still being discussed) allocate memory.
> > Yes, at this point we have reserved the memory for the execbuf.
> >
> >>> It
> >>> looks like i915_pin_vma takes a reservation_ww_class. And all memory
> >>> pinning seems to be in the fence critical path as well?
> > Correct, it's not meant to be waiting inside i915_vma_pin(); the
> > intention was to pass in memory, and then we would not need to
> > do the acquire ourselves. As we have just reserved the memory in the
> > above loop, this should not be an issue. I was trying to keep the
> > change minimal and allow incremental conversions. It does however need
> > to add a reference to the object for the work it spawns -- equally
> > though there is an async eviction pass later in execbuf. The challenge
> > here is that the greedy grab of bound vma is faster than doing the
> > unbound eviction handling (even when eviction is not required).
> 
> So for the i915_vma_pin, it looks like
> 
> fence_critical_start(eb_reserve_mm) -> 
> dma_resv_lock_interruptible(i915_vma_pin) -> lockdep issue.
> 
> You can't take the dma_resv_lock inside a fence critical section.

Aye, and that is trivially liftable since the allocation is provided by
the caller. But we still want one off access, hence the preference for
keeping the convenience of i915_vma_pin until all callers have
transitioned.

> And for the memory allocation, it looks like the fence is published in 
> the first loop iteration, starting the critical section, meaning that 
> any memory allocation that follows will cause a lockdep issue. That 
> includes worker threads. (with the proposed dma_fence annotations).

I fail to be convinced that proposal is a good solution.

> >> And I think even if we at some point end up with the allocation
> >> annotation the other way around, allowing memory allocations in fence
> >> signalling critical paths, both relocations and userpointer would cause
> >> lockdep problems because of
> >>
> >> mmap_sem->reservation_object->fence_wait (fault handlers, lockdep priming)
> > We don't wait inside mmap_sem. One cannot, you do not know the locking
> > context, so you can only try to reclaim idle space. So you end up with
> > the issue of a multitude of threads each trying to claim the last slice
> > of the aperture/backing storage, not being able to directly reclaim and
> > so have to hit the equivalent of kswapd.
> 
> I don't think I follow you here. There are a number of drivers that wait 
> for dma_fences inside the fault handlers with mmap_sem held for data to 
> be migrated before the pte is set up.

I mean that userspace is at liberty to prevent the migration
arbitrarily, forming a resource lock. Waiting for userspace while holding
any mutex is an easy deadlock. Hence why any such wait for eviction of another
must be interruptible (either by signal, timeout), ensure force completion or
never actually wait. A wait for itself should at least be killable.

> >> vs
> >> fence_critical->gup/copy_from_user->mmap_sem
> > Which exists today, even the busy wait loop is implicit linkage; you only
> > need userspace to be holding a resource on the gpu to create the deadlock.
> > I've been using the userfault handler to develop test cases where we can
> > arbitrarily block the userptr.
> 
> Yes but in a case where we don't publish the fence early, the above 
> would be reduced to the well known reservation_ww_class vs mmap_sem 
> lockdep issue, which other drivers seem to have solved and we could copy 
> what they've done.

I don't see the difference. We have gup, malloc, copy_user inside a fence
as it stands. What I would want to address, with say, ttm_bo_vm_reserve is
that faulting should only care about the migration chain, and we should
be careful when selecting eviction candidates (if only because we need
to respect memory prioritisation).
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx