Re: [PATCH drm-misc-next v5 4/6] drm/gpuvm: track/lock/validate external/evicted objects

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Tue, 03 Oct 2023 19:37:53 +0200

Hi, Danilo

On Tue, 2023-10-03 at 18:55 +0200, Danilo Krummrich wrote:
> It seems like we're mostly aligned on this series, except for the key
> controversy we're discussing for a few versions now: locking of the
> internal
> lists. Hence, let's just re-iterate the options we have to get this
> out of the
> way.
> 
> (1) The spinlock dance. This basically works for every use case,
> updating the VA
>     space from the IOCTL, from the fence signaling path or anywhere
> else.
>     However, it has the downside of requiring spin_lock() /
> spin_unlock() for
>     *each* list element when locking all external objects and
> validating all
>     evicted objects. Typically, the amount of extobjs and evicted
> objects
>     shouldn't be excessive, but there might be exceptions, e.g. Xe.
> 
> (2) The dma-resv lock dance. This is convinient for drivers updating
> the VA
>     space from a VM_BIND ioctl() and is especially efficient if such
> drivers
>     have a huge amount of external and/or evicted objects to manage.
> However,
>     the downsides are that it requires a few tricks in drivers
> updating the VA
>     space from the fence signaling path (e.g. job_run()). Design
> wise, I'm still
>     skeptical that it is a good idea to protect internal data
> structures with
>     external locks in a way that it's not clear to callers that a
> certain
>     function would access one of those resources and hence needs
> protection.
>     E.g. it is counter intuitive that drm_gpuvm_bo_put() would
> require both the
>     dma-resv lock of the corresponding object and the VM's dma-resv
> lock held.
>     (Additionally, there were some concerns from amdgpu regarding
> flexibility in
>     terms of using GPUVM for non-VM_BIND uAPIs and compute, however,
> AFAICS
>     those discussions did not complete and to me it's still unclear
> why it
>     wouldn't work.)
> 
> (3) Simply use an internal mutex per list. This adds a tiny (IMHO
> negligible)
>     overhead for drivers updating the VA space from a VM_BIND
> ioctl(), namely
>     a *single* mutex_lock()/mutex_unlock() when locking all external
> objects
>     and validating all evicted objects. And it still requires some
> tricks for
>     drivers updating the VA space from the fence signaling path.
> However, it's
>     as simple as it can be and hence way less error prone as well as
>     self-contained and hence easy to use. Additionally, it's flexible
> in a way
>     that we don't have any expections on drivers to already hold
> certain locks
>     that the driver in some situation might not be able to acquire in
> the first
>     place.

Such an overhead is fully OK IMO, But didn't we conclude at some point
that using a mutex in this way isn't possible due to the fact that
validate() needs to be able to lock dma_resv, and then we have
dma_resv()->mutex->dma_resv()?

> 
> (4) Arbitrary combinations of the above. For instance, the current V5
> implements
>     both (1) and (2) (as either one or the other). But also (1) and
> (3) (as in
>     (1) additionally to (3)) would be an option, where a driver could
> opt-in for
>     the spinlock dance in case it updates the VA space from the fence
> signaling
>     path.
> 
> I also considered a few other options as well, however, they don't
> seem to be
> flexible enough. For instance, as by now we could use SRCU for the
> external
> object list. However, this falls apart once a driver wants to remove
> and re-add
> extobjs for the same VM_BO instance. (For the same reason it wouldn't
> work for
> evicted objects.)
> 
> Personally, after seeing the weird implications of (1), (2) and a
> combination of
> both, I tend to go with (3). Optionally, with an opt-in for (1). The
> reason for
> the latter is that with (3) the weirdness of (1) by its own mostly
> disappears.
> 
> Please let me know what you think, and, of course, other ideas than
> the
> mentioned ones above are still welcome.

Personally, after converting xe to version 5, I think it's pretty
convenient for the driver, (although had to add the evict trick), so I
think I'd vote for this, even if not currently using the opt-in for
(1).

/Thomas

> 
> - Danilo
> 
> On Tue, Oct 03, 2023 at 04:21:43PM +0200, Boris Brezillon wrote:
> > On Tue, 03 Oct 2023 14:25:56 +0200
> > Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> wrote:
> > 
> > > > > > +/**
> > > > > > + * get_next_vm_bo_from_list() - get the next vm_bo element
> > > > > > + * @__gpuvm: The GPU VM
> > > > > > + * @__list_name: The name of the list we're iterating on
> > > > > > + * @__local_list: A pointer to the local list used to
> > > > > > store
> > > > > > already iterated items
> > > > > > + * @__prev_vm_bo: The previous element we got from
> > > > > > drm_gpuvm_get_next_cached_vm_bo()
> > > > > > + *
> > > > > > + * This helper is here to provide lockless list iteration.
> > > > > > Lockless as in, the
> > > > > > + * iterator releases the lock immediately after picking
> > > > > > the
> > > > > > first element from
> > > > > > + * the list, so list insertion deletion can happen
> > > > > > concurrently.
> > > > > > + *
> > > > > > + * Elements popped from the original list are kept in a
> > > > > > local
> > > > > > list, so removal
> > > > > > + * and is_empty checks can still happen while we're
> > > > > > iterating
> > > > > > the list.
> > > > > > + */
> > > > > > +#define get_next_vm_bo_from_list(__gpuvm, __list_name,
> > > > > > __local_list, __prev_vm_bo)     \
> > > > > > +       ({                                                 
> > > > > >       
> > > > > >                        \
> > > > > > +               struct drm_gpuvm_bo *__vm_bo =
> > > > > > NULL;                                    \
> > > > > > +                                                          
> > > > > >       
> > > > > >                        \
> > > > > > +               drm_gpuvm_bo_put(__prev_vm_bo);            
> > > > > >       
> > > > > >                        \
> > > > > > +                                                          
> > > > > >       
> > > > > >                        \
> > > > > > +               spin_lock(&(__gpuvm)-  
> > > > > > > __list_name.lock);                                \    
> > > > > 
> > > > > Here we unconditionally take the spinlocks while iterating,
> > > > > and the
> > > > > main 
> > > > > point of DRM_GPUVM_RESV_PROTECTED was really to avoid that?
> > > > > 
> > > > >   
> > > > > > +               if (!(__gpuvm)-  
> > > > > > > __list_name.local_list)                                 \
> > > > > > >   
> > > > > > +                       (__gpuvm)->__list_name.local_list =
> > > > > > __local_list;               \
> > > > > > +               else                                       
> > > > > >       
> > > > > >                        \
> > > > > > +                       WARN_ON((__gpuvm)-
> > > > > > >__list_name.local_list
> > > > > > != __local_list);     \
> > > > > > +                                                          
> > > > > >       
> > > > > >                        \
> > > > > > +               while (!list_empty(&(__gpuvm)-
> > > > > > >__list_name.list))
> > > > > > {                     \
> > > > > > +                       __vm_bo =
> > > > > > list_first_entry(&(__gpuvm)-  
> > > > > > > __list_name.list,        \  
> > > > > > +                                                  struct
> > > > > > drm_gpuvm_bo,                 \
> > > > > > +                                                 
> > > > > > list.entry.__list_name);             \
> > > > > > +                       if (kref_get_unless_zero(&__vm_bo-
> > > > > > >kref))
> > > > > > {    
> > > > > And unnecessarily grab a reference in the RESV_PROTECTED
> > > > > case.  
> > > > > >                         \
> > > > > > +                               list_move_tail(&(__vm_bo)- 
> > > > > > > list.entry.__list_name,      \  
> > > > > > +                                             
> > > > > > __local_list);                           \
> > > > > > +                               break;                     
> > > > > >       
> > > > > >                        \
> > > > > > +                       } else
> > > > > > {                                                        \
> > > > > > +                               list_del_init(&(__vm_bo)-  
> > > > > > > list.entry.__list_name);      \  
> > > > > > +                               __vm_bo =
> > > > > > NULL;                                         \
> > > > > > +                       }                                  
> > > > > >       
> > > > > >                        \
> > > > > > +               }                                          
> > > > > >       
> > > > > >                        \
> > > > > > +               spin_unlock(&(__gpuvm)-  
> > > > > > > __list_name.lock);                              \  
> > > > > > +                                                          
> > > > > >       
> > > > > >                        \
> > > > > > +               __vm_bo;                                   
> > > > > >       
> > > > > >                        \
> > > > > > +       })    
> > > > > 
> > > > > IMHO this lockless list iteration looks very complex and
> > > > > should be 
> > > > > pretty difficult to maintain while moving forward, also since
> > > > > it
> > > > > pulls 
> > > > > the gpuvm_bos off the list, list iteration needs to be
> > > > > protected by
> > > > > an 
> > > > > outer lock anyway.  
> > > > 
> > > > As being partly responsible for this convoluted list iterator,
> > > > I must
> > > > say I agree with you. There's so many ways this can go wrong if
> > > > the
> > > > user doesn't call it the right way, or doesn't protect
> > > > concurrent
> > > > list
> > > > iterations with a separate lock (luckily, this is a private
> > > > iterator). I
> > > > mean, it works, so there's certainly a way to get it right, but
> > > > gosh,
> > > > this is so far from the simple API I had hoped for.
> > > >   
> > > > > Also from what I understand from Boris, the extobj 
> > > > > list would typically not need the fine-grained locking; only
> > > > > the
> > > > > evict 
> > > > > list?  
> > > > 
> > > > Right, I'm adding the gpuvm_bo to extobj list in the ioctl
> > > > path, when
> > > > the GEM and VM resvs are held, and I'm deferring the
> > > > drm_gpuvm_bo_put()
> > > > call to a work that's not in the dma-signalling path. This
> > > > being
> > > > said,
> > > > I'm still not comfortable with the
> > > > 
> > > > gem = drm_gem_object_get(vm_bo->gem);
> > > > dma_resv_lock(gem->resv);
> > > > drm_gpuvm_bo_put(vm_bo);
> > > > dma_resv_unlock(gem->resv);
> > > > drm_gem_object_put(gem);
> > > > 
> > > > dance that's needed to avoid a UAF when the gpuvm_bo is the
> > > > last GEM
> > > > owner, not to mention that drm_gpuva_unlink() calls
> > > > drm_gpuvm_bo_put()
> > > > after making sure the GEM gpuvm_list lock is held, but this
> > > > lock
> > > > might
> > > > differ from the resv lock (custom locking so we can call
> > > > gpuvm_unlink() in the dma-signalling path). So we now have
> > > > paths
> > > > where
> > > > drm_gpuvm_bo_put() are called with the resv lock held, and
> > > > others
> > > > where
> > > > they are not, and that only works because we're relying on the
> > > > the
> > > > fact
> > > > those drm_gpuvm_bo_put() calls won't make the refcount drop to
> > > > zero,
> > > > because the deferred vm_bo_put() work still owns a vm_bo ref.  
> > > 
> > > I'm not sure I follow to 100% here, but in the code snippet above
> > > it's
> > > pretty clear to me that it needs to hold an explicit gem object
> > > reference when calling dma_resv_unlock(gem->resv). Each time you
> > > copy a
> > > referenced pointer (here from vm_bo->gem to gem) you need to up
> > > the
> > > refcount unless you make sure (by locks or other means) that the
> > > source
> > > of the copy has a strong refcount and stays alive, so that's no
> > > weird
> > > action to me. Could possibly add a drm_gpuvm_bo_get_gem() to
> > > access the
> > > gem member (and that also takes a refcount) for driver users to
> > > avoid
> > > the potential pitfall.
> > 
> > Except this is only needed because of the GEM-resv-must-be-held
> > locking
> > constraint that was added on vm_bo_put(). I mean, the usual way we
> > do
> > object un-referencing is by calling _put() and letting the internal
> > logic undo things when the refcount drops to zero. If the object
> > needs
> > to be removed from some list, it's normally the responsibility of
> > the
> > destruction method to lock the list, remove the object and unlock
> > the
> > list. Now, we have a refcounted object that's referenced by vm_bo,
> > and
> > whose lock needs to be taken when the destruction happens, which
> > leads
> > to this weird dance described above, when, in normal situations,
> > we'd
> > just call drm_gpuvm_bo_put(vm_bo) and let drm_gpuvm do its thing.
> > 
> > > 
> > > > 
> > > > All these tiny details add to the overall complexity of this
> > > > common
> > > > layer, and to me, that's not any better than the
> > > > get_next_vm_bo_from_list() complexity you were complaining
> > > > about
> > > > (might
> > > > be even worth, because this sort of things leak to users).
> > > > 
> > > > Having an internal lock partly solves that, in that the locking
> > > > of
> > > > the
> > > > extobj list is now entirely orthogonal to the GEM that's being
> > > > removed
> > > > from this list, and we can lock/unlock internally without
> > > > forcing the
> > > > caller to take weird actions to make sure things don't explode.
> > > > Don't
> > > > get me wrong, I get that this locking overhead is not
> > > > acceptable for
> > > > Xe, but I feel like we're turning drm_gpuvm into a white
> > > > elephant
> > > > that
> > > > only few people will get right.  
> > > 
> > > I tend to agree, but to me the big complication comes from the
> > > async
> > > (dma signalling path) state updates.
> > 
> > I don't deny updating the VM state from the dma signalling path
> > adds
> > some amount of complexity, but the fact we're trying to use
> > dma_resv
> > locks for everything, including protection of internal datasets
> > doesn't
> > help. Anyway, I think both of us are biased when it comes to
> > judging
> > which approach adds the most complexity :P.
> > 
> > Also note that, right now, the only thing I'd like to be able to
> > update
> > from the dma signalling path is the VM mapping tree. Everything
> > else
> > (drm_gpuva_[un]link(), add/remove extobj), we could do outside this
> > path:
> > 
> > - for MAP operations, we could call drm_gpuva_link() in the ioctl
> > path
> >   (we'd just need to initialize the drm_gpuva object)
> > - for MAP operations, we're already calling drm_gpuvm_bo_obtain()
> > from
> >   the ioctl path
> > - for UNMAP operations, we could add the drm_gpuva_unlink() call to
> > the
> >   VM op cleanup worker
> > 
> > The only problem we'd have is that drm_gpuva_link() needs to be
> > called
> > inside drm_gpuvm_ops::sm_step_remap() when a remap with next/prev
> > !=
> > NULL occurs, otherwise we lose track of these mappings.
> > 
> > > 
> > > Let's say for example we have a lower level lock for the gem
> > > object's
> > > gpuvm_bo list. Some drivers grab it from the dma fence signalling
> > > path,
> > > other drivers need to access all vm's of a bo to grab their
> > > dma_resv
> > > locks using a WW transaction. There will be problems, although
> > > probably
> > > solveable.
> > 
> > To me, the gpuvm extobj vm_bo list is just an internal list and has
> > an
> > internal lock associated. The lock that's protecting the GEM vm_bo
> > list
> > is a bit different in that the driver gets to decide when a vm_bo
> > is
> > inserted/removed by calling drm_gpuvm_[un]link(), and can easily
> > make
> > sure the lock is held when this happens, while the gpuvm internal
> > lists
> > are kinda transparently updated (for instance, the first caller of
> > drm_gpuvm_bo_obtain() adds the vm_bo to the extobj and the last
> > vm_bo
> > owner calling drm_gpuvm_bo_put() removes it from this list, which
> > is
> > certainly not obvious based on the name of these functions).
> > 
> > If we want to let drivers iterate over the extobj/evict lists, and
> > assuming they are considered internal lists maintained by the core
> > and
> > protected with an internal lock, we should indeed provide iteration
> > helpers that:
> > 
> > 1/ make sure all the necessary external locks are held (VM resv, I
> >    guess)
> > 2/ make sure the internal lock is not held during iteration (the
> > sort
> >    of snapshot list trick you're using for the evict list in Xe)
> > 
> > > > > Also it seems that if we are to maintain two modes here, for 
> > > > > reasonably clean code we'd need two separate instances of 
> > > > > get_next_bo_from_list().
> > > > > 
> > > > > For the !RESV_PROTECTED case, perhaps one would want to
> > > > > consider
> > > > > the 
> > > > > solution used currently in xe, where the VM maintains two
> > > > > evict
> > > > > lists. 
> > > > > One protected by a spinlock and one protected by the VM resv.
> > > > > When
> > > > > the 
> > > > > VM resv is locked to begin list traversal, the spinlock is
> > > > > locked
> > > > > *once* 
> > > > > and the spinlock-protected list is looped over and copied
> > > > > into the
> > > > > resv 
> > > > > protected one. For traversal, the resv protected one is
> > > > > used.  
> > > > 
> > > > Oh, so you do have the same sort of trick where you move the
> > > > entire
> > > > list to another list, such that you can let other paths update
> > > > the
> > > > list
> > > > while you're iterating your own snapshot. That's
> > > > interesting...  
> > > 
> > > Yes, it's instead of the "evicted" bool suggested here. I thought
> > > the
> > > latter would be simpler. Although that remains to be seen after
> > > all
> > > use-cases are implemented.
> > > 
> > > But in general I think the concept of copying from a staging list
> > > to
> > > another with different protection rather than traversing the
> > > first list
> > > and unlocking between items is a good way of solving the locking
> > > inversion problem with minimal overhead. We use it also for O(1)
> > > userptr validation.
> > 
> > That's more or less the idea behind get_next_vm_bo_from_list()
> > except
> > it's dequeuing one element at a time, instead of moving all items
> > at
> > once. Note that, if you allow concurrent removal protected only by
> > the
> > spinlock, you still need to take/release this spinlock when
> > iterating
> > over elements of this snapshot list, because all the remover needs
> > to
> > remove an element is the element itself, and it doesn't care in
> > which
> > list it's currently inserted (real or snapshot/staging list), so
> > you'd
> > be iterating over a moving target if you don't protect the
> > iteration
> > with the spinlock.
> > 
>