Re: [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

Daniel Vetter <daniel@xxxxxxxx> · Thu, 27 Feb 2020 10:38:43 +0100

On Thu, Feb 27, 2020 at 10:20 AM Christian König
<christian.koenig@xxxxxxx> wrote:
> Am 26.02.20 um 17:32 schrieb Daniel Vetter:
> > On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
> >> On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:
> >>> Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
> >>>> On 2/23/20 4:45 PM, Christian König wrote:
> >>>>> Am 21.02.20 um 18:12 schrieb Daniel Vetter:
> >>>>>> [SNIP]
> >>>>>> Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly
> >>>>>> degenerating
> >>>>>> into essentially a global lock. But only when there's actual contention
> >>>>>> and thrashing.
> >>>>> Yes exactly. A really big problem in TTM is currently that we drop
> >>>>> the lock after evicting BOs because they tend to move in again
> >>>>> directly after that.
> >>>>>
> >>>>>  From practice I can also confirm that there is exactly zero benefit
> >>>>> from dropping locks early and reacquire them for example for the VM
> >>>>> page tables. That's just makes it more likely that somebody needs to
> >>>>> roll back and this is what we need to avoid in the first place.
> >>>> If you have a benchmarking setup available it would be very interesting
> >>>> for future reference to see how changing from WD to WW mutexes affects
> >>>> the roll back frequency. WW is known to cause rollbacks much less
> >>>> frequently but there is more work associated with each rollback.
> >>> Not of hand. To be honest I still have a hard time to get a grip on the
> >>> difference between WD and WW from the algorithm point of view. So I can't
> >>> judge that difference at all.
> >>>
> >>>>> Contention on BO locks during command submission is perfectly fine
> >>>>> as long as this is as lightweight as possible while we don't have
> >>>>> trashing. When we have trashing multi submission performance is best
> >>>>> archived to just favor a single process to finish its business and
> >>>>> block everybody else.
> >>>> Hmm. Sounds like we need a per-manager ww_rwsem protecting manager
> >>>> allocation, taken in write-mode then there's thrashing. In read-mode
> >>>> otherwise. That would limit the amount of "unnecessary" locks we'd have
> >>>> to keep and reduce unwanted side-effects, (see below):
> >>> Well per-manager (you mean per domain here don't you?) doesn't sound like
> >>> that useful because we rarely use only one domain, but I'm actually
> >>> questioning for quite a while if the per BO lock scheme was the right
> >>> approach.
> >>>
> >>> See from the performance aspect the closest to ideal solution I can think of
> >>> would be a ww_rwsem per user of a resource.
> >>>
> >>> In other words we don't lock BOs, but instead a list of all their users and
> >>> when you want to evict a BO you need to walk that list and inform all users
> >>> that the BO will be moving.
> >>>
> >>> During command submission you then have the fast path which rather just
> >>> grabs the read side of the user lock and check if all BOs are still in the
> >>> expected place.
> >>>
> >>> If some BOs were evicted you back off and start the slow path, e.g. maybe
> >>> even copy additional data from userspace then grab the write side of the
> >>> lock etc.. etc...
> >>>
> >>> That approach is similar to what we use in amdgpu with the per-VM BOs, but
> >>> goes a step further. Problem is that we are so used to per BO locks in the
> >>> kernel that this is probably not doable any more.
> >> Yeah I think it'd be nice to have the same approach for shared bo too. I
> >> guess what we could do is something like this (spinning your ww_rwmutex
> >> idea a bit further):
> >>
> >> dma_buf_read_lock(buf, vm)
> >> {
> >>          if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH))
> >>          {
> >>                  check that vm is indeed listed in buf and splat if not
> >>          }
> >>
> >>          /* for a buf that's not shared in multiple vm we'd have buf->resv
> >>           * == vm->resv here */
> >>          return ww_mutex_lock(vm->resv);
> >> }
> >>
> >> dma_buf_write_lock(buf)
> >> {
> >>          for_each_vm_in_buf(buf, vm) {
> >>                  ww_mutex_lock(vm->resv);
> >>          }
> >> }
> >>
> >> Ideally we'd track all these vms with something slightly less shoddy than
> >> a linked list :-) Resizeable array is probably pretty good, I think we
> >> only ever need to go from buf -> vm list, not the other way round. At
> >> least in dma_resv/dma_buf code, driver code ofc needs to keep a list of
> >> all bo bound to a vm somewhere. But that's probably a much bigger
> >> datastructure for tracking vma offsets and mappings and other things on
> >> top.
> >>
> >> Ofc to even just get there we'd need something like the sublock list to
> >> keep track of all the additional locks we'd need for the writer lock. And
> >> we'd need the release callback for backoff, so that we could also go
> >> through the slowpath on a vm object that we're not holding a full
> >> reference on. That also means vm need to be refcounted.
> >>
> >> And the list of vms on a buffer need to be protected with some lock and
> >> the usual kref_get_unless_zero trickery.
> >>
> >> But with all this I think we can make the dma_buf_write_lock lock 100%
> >> like the old per-buffer lock for everyone. And execbuf could switch over
> >> to dma_buf_read_lock for shared buffers. Bonus points when the gpu context
> >> just keeps track of a list of shared vm used by buffers in that context
> >> ...
> >>
> >> That way we could make vm fastpath locking a la amdgpu opt-in, while
> >> keeping everyone else on the per-object locking juices.
> >>
> >> Thoughts?
>
> At least to me that sounds like a plan.
>
> > One thing I just realized, which is nasty: The full (write) lock needs
> > ww_acquire_ctx with this, because it needs to take a bunch of locks.
> > Rolling that out everywhere is going to be nasty.
>
> Why? Take a single write lock shouldn't be different to taking a single
> ww_mutex, or am I missing something?

There's no single write lock in my proposal. The write lock for a
buffer is "take all the vm locks this buffer is mapped into". Which
means we need to take multiple ww_mutex locks, which means backoff and
everything.

Otherwise the read lock of just taking the per vm dma_resv lock won't
work. And doing an true rw lock on each buffer feels a bit pointless,
since then execbuf is back to O(num_buffers). And that's the suck
we're trying to avoid.

> > I guess though we could do a fallback and have a locally created
> > ww_acquire_ctx if there's none passed in, with backoff entirely
> > implemented within dma_resv_lock.
>
> How should that work? As far as I understand it the ww_acquire_ctx must
> be kept existing until after the last of the locks it was used with is
> unlocked. Or do I see this incorrectly?

Ugh right.

I guess we could put the ww_acquire_ctx into the dma_resv so it
survives until the unlock. But that's a rather gross hack.
-Daniel

> > -Daniel
> >
> >> Cheers, Daniel
> >>
> >> PS: Of course the write lock for these buffers is going to be terrible, so
> >> every time you need to update fences for implicit sync on shared buffer
> >> (well write fence at least) it's going to suck. We probably also want a
> >> read_to_write_upgrade function, which also can be done easily with
> >> ww_mutex magic.
>
> I'm thinking that we probably sole want a read_to_write upgrade function.
>
> Regards,
> Christian.
>
> >> --
> >> Daniel Vetter
> >> Software Engineer, Intel Corporation
> >>
>

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch