On Thu, Feb 27, 2020 at 10:20 AM Christian König <christian.koenig@xxxxxxx> wrote: > Am 26.02.20 um 17:32 schrieb Daniel Vetter: > > On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter <daniel@xxxxxxxx> wrote: > >> On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote: > >>> Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware): > >>>> On 2/23/20 4:45 PM, Christian König wrote: > >>>>> Am 21.02.20 um 18:12 schrieb Daniel Vetter: > >>>>>> [SNIP] > >>>>>> Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly > >>>>>> degenerating > >>>>>> into essentially a global lock. But only when there's actual contention > >>>>>> and thrashing. > >>>>> Yes exactly. A really big problem in TTM is currently that we drop > >>>>> the lock after evicting BOs because they tend to move in again > >>>>> directly after that. > >>>>> > >>>>> From practice I can also confirm that there is exactly zero benefit > >>>>> from dropping locks early and reacquire them for example for the VM > >>>>> page tables. That's just makes it more likely that somebody needs to > >>>>> roll back and this is what we need to avoid in the first place. > >>>> If you have a benchmarking setup available it would be very interesting > >>>> for future reference to see how changing from WD to WW mutexes affects > >>>> the roll back frequency. WW is known to cause rollbacks much less > >>>> frequently but there is more work associated with each rollback. > >>> Not of hand. To be honest I still have a hard time to get a grip on the > >>> difference between WD and WW from the algorithm point of view. So I can't > >>> judge that difference at all. > >>> > >>>>> Contention on BO locks during command submission is perfectly fine > >>>>> as long as this is as lightweight as possible while we don't have > >>>>> trashing. When we have trashing multi submission performance is best > >>>>> archived to just favor a single process to finish its business and > >>>>> block everybody else. > >>>> Hmm. Sounds like we need a per-manager ww_rwsem protecting manager > >>>> allocation, taken in write-mode then there's thrashing. In read-mode > >>>> otherwise. That would limit the amount of "unnecessary" locks we'd have > >>>> to keep and reduce unwanted side-effects, (see below): > >>> Well per-manager (you mean per domain here don't you?) doesn't sound like > >>> that useful because we rarely use only one domain, but I'm actually > >>> questioning for quite a while if the per BO lock scheme was the right > >>> approach. > >>> > >>> See from the performance aspect the closest to ideal solution I can think of > >>> would be a ww_rwsem per user of a resource. > >>> > >>> In other words we don't lock BOs, but instead a list of all their users and > >>> when you want to evict a BO you need to walk that list and inform all users > >>> that the BO will be moving. > >>> > >>> During command submission you then have the fast path which rather just > >>> grabs the read side of the user lock and check if all BOs are still in the > >>> expected place. > >>> > >>> If some BOs were evicted you back off and start the slow path, e.g. maybe > >>> even copy additional data from userspace then grab the write side of the > >>> lock etc.. etc... > >>> > >>> That approach is similar to what we use in amdgpu with the per-VM BOs, but > >>> goes a step further. Problem is that we are so used to per BO locks in the > >>> kernel that this is probably not doable any more. > >> Yeah I think it'd be nice to have the same approach for shared bo too. I > >> guess what we could do is something like this (spinning your ww_rwmutex > >> idea a bit further): > >> > >> dma_buf_read_lock(buf, vm) > >> { > >> if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH)) > >> { > >> check that vm is indeed listed in buf and splat if not > >> } > >> > >> /* for a buf that's not shared in multiple vm we'd have buf->resv > >> * == vm->resv here */ > >> return ww_mutex_lock(vm->resv); > >> } > >> > >> dma_buf_write_lock(buf) > >> { > >> for_each_vm_in_buf(buf, vm) { > >> ww_mutex_lock(vm->resv); > >> } > >> } > >> > >> Ideally we'd track all these vms with something slightly less shoddy than > >> a linked list :-) Resizeable array is probably pretty good, I think we > >> only ever need to go from buf -> vm list, not the other way round. At > >> least in dma_resv/dma_buf code, driver code ofc needs to keep a list of > >> all bo bound to a vm somewhere. But that's probably a much bigger > >> datastructure for tracking vma offsets and mappings and other things on > >> top. > >> > >> Ofc to even just get there we'd need something like the sublock list to > >> keep track of all the additional locks we'd need for the writer lock. And > >> we'd need the release callback for backoff, so that we could also go > >> through the slowpath on a vm object that we're not holding a full > >> reference on. That also means vm need to be refcounted. > >> > >> And the list of vms on a buffer need to be protected with some lock and > >> the usual kref_get_unless_zero trickery. > >> > >> But with all this I think we can make the dma_buf_write_lock lock 100% > >> like the old per-buffer lock for everyone. And execbuf could switch over > >> to dma_buf_read_lock for shared buffers. Bonus points when the gpu context > >> just keeps track of a list of shared vm used by buffers in that context > >> ... > >> > >> That way we could make vm fastpath locking a la amdgpu opt-in, while > >> keeping everyone else on the per-object locking juices. > >> > >> Thoughts? > > At least to me that sounds like a plan. > > > One thing I just realized, which is nasty: The full (write) lock needs > > ww_acquire_ctx with this, because it needs to take a bunch of locks. > > Rolling that out everywhere is going to be nasty. > > Why? Take a single write lock shouldn't be different to taking a single > ww_mutex, or am I missing something? There's no single write lock in my proposal. The write lock for a buffer is "take all the vm locks this buffer is mapped into". Which means we need to take multiple ww_mutex locks, which means backoff and everything. Otherwise the read lock of just taking the per vm dma_resv lock won't work. And doing an true rw lock on each buffer feels a bit pointless, since then execbuf is back to O(num_buffers). And that's the suck we're trying to avoid. > > I guess though we could do a fallback and have a locally created > > ww_acquire_ctx if there's none passed in, with backoff entirely > > implemented within dma_resv_lock. > > How should that work? As far as I understand it the ww_acquire_ctx must > be kept existing until after the last of the locks it was used with is > unlocked. Or do I see this incorrectly? Ugh right. I guess we could put the ww_acquire_ctx into the dma_resv so it survives until the unlock. But that's a rather gross hack. -Daniel > > -Daniel > > > >> Cheers, Daniel > >> > >> PS: Of course the write lock for these buffers is going to be terrible, so > >> every time you need to update fences for implicit sync on shared buffer > >> (well write fence at least) it's going to suck. We probably also want a > >> read_to_write_upgrade function, which also can be done easily with > >> ww_mutex magic. > > I'm thinking that we probably sole want a read_to_write upgrade function. > > Regards, > Christian. > > >> -- > >> Daniel Vetter > >> Software Engineer, Intel Corporation > >> > -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch