On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter <daniel@xxxxxxxx> wrote: > > On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote: > > Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware): > > > On 2/23/20 4:45 PM, Christian König wrote: > > > > Am 21.02.20 um 18:12 schrieb Daniel Vetter: > > > > > [SNIP] > > > > > Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly > > > > > degenerating > > > > > into essentially a global lock. But only when there's actual contention > > > > > and thrashing. > > > > > > > > Yes exactly. A really big problem in TTM is currently that we drop > > > > the lock after evicting BOs because they tend to move in again > > > > directly after that. > > > > > > > > From practice I can also confirm that there is exactly zero benefit > > > > from dropping locks early and reacquire them for example for the VM > > > > page tables. That's just makes it more likely that somebody needs to > > > > roll back and this is what we need to avoid in the first place. > > > > > > If you have a benchmarking setup available it would be very interesting > > > for future reference to see how changing from WD to WW mutexes affects > > > the roll back frequency. WW is known to cause rollbacks much less > > > frequently but there is more work associated with each rollback. > > > > Not of hand. To be honest I still have a hard time to get a grip on the > > difference between WD and WW from the algorithm point of view. So I can't > > judge that difference at all. > > > > > > Contention on BO locks during command submission is perfectly fine > > > > as long as this is as lightweight as possible while we don't have > > > > trashing. When we have trashing multi submission performance is best > > > > archived to just favor a single process to finish its business and > > > > block everybody else. > > > > > > Hmm. Sounds like we need a per-manager ww_rwsem protecting manager > > > allocation, taken in write-mode then there's thrashing. In read-mode > > > otherwise. That would limit the amount of "unnecessary" locks we'd have > > > to keep and reduce unwanted side-effects, (see below): > > > > Well per-manager (you mean per domain here don't you?) doesn't sound like > > that useful because we rarely use only one domain, but I'm actually > > questioning for quite a while if the per BO lock scheme was the right > > approach. > > > > See from the performance aspect the closest to ideal solution I can think of > > would be a ww_rwsem per user of a resource. > > > > In other words we don't lock BOs, but instead a list of all their users and > > when you want to evict a BO you need to walk that list and inform all users > > that the BO will be moving. > > > > During command submission you then have the fast path which rather just > > grabs the read side of the user lock and check if all BOs are still in the > > expected place. > > > > If some BOs were evicted you back off and start the slow path, e.g. maybe > > even copy additional data from userspace then grab the write side of the > > lock etc.. etc... > > > > That approach is similar to what we use in amdgpu with the per-VM BOs, but > > goes a step further. Problem is that we are so used to per BO locks in the > > kernel that this is probably not doable any more. > > Yeah I think it'd be nice to have the same approach for shared bo too. I > guess what we could do is something like this (spinning your ww_rwmutex > idea a bit further): > > dma_buf_read_lock(buf, vm) > { > if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH)) > { > check that vm is indeed listed in buf and splat if not > } > > /* for a buf that's not shared in multiple vm we'd have buf->resv > * == vm->resv here */ > return ww_mutex_lock(vm->resv); > } > > dma_buf_write_lock(buf) > { > for_each_vm_in_buf(buf, vm) { > ww_mutex_lock(vm->resv); > } > } > > Ideally we'd track all these vms with something slightly less shoddy than > a linked list :-) Resizeable array is probably pretty good, I think we > only ever need to go from buf -> vm list, not the other way round. At > least in dma_resv/dma_buf code, driver code ofc needs to keep a list of > all bo bound to a vm somewhere. But that's probably a much bigger > datastructure for tracking vma offsets and mappings and other things on > top. > > Ofc to even just get there we'd need something like the sublock list to > keep track of all the additional locks we'd need for the writer lock. And > we'd need the release callback for backoff, so that we could also go > through the slowpath on a vm object that we're not holding a full > reference on. That also means vm need to be refcounted. > > And the list of vms on a buffer need to be protected with some lock and > the usual kref_get_unless_zero trickery. > > But with all this I think we can make the dma_buf_write_lock lock 100% > like the old per-buffer lock for everyone. And execbuf could switch over > to dma_buf_read_lock for shared buffers. Bonus points when the gpu context > just keeps track of a list of shared vm used by buffers in that context > ... > > That way we could make vm fastpath locking a la amdgpu opt-in, while > keeping everyone else on the per-object locking juices. > > Thoughts? One thing I just realized, which is nasty: The full (write) lock needs ww_acquire_ctx with this, because it needs to take a bunch of locks. Rolling that out everywhere is going to be nasty. I guess though we could do a fallback and have a locally created ww_acquire_ctx if there's none passed in, with backoff entirely implemented within dma_resv_lock. -Daniel > > Cheers, Daniel > > PS: Of course the write lock for these buffers is going to be terrible, so > every time you need to update fences for implicit sync on shared buffer > (well write fence at least) it's going to suck. We probably also want a > read_to_write_upgrade function, which also can be done easily with > ww_mutex magic. > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch