On Tue, Jan 17, 2023 at 01:21:47PM -0800, Suren Baghdasaryan wrote: > On Tue, Jan 17, 2023 at 7:12 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > On Tue 17-01-23 16:04:26, Michal Hocko wrote: > > > On Mon 09-01-23 12:53:07, Suren Baghdasaryan wrote: > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling > > > > instead of mmap_lock. Because there are cases when multiple VMAs need > > > > to be exclusively locked during VMA tree modifications, instead of the > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock, > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as > > > > locked. > > > > > > I have to say I was struggling a bit with the above and only understood > > > what you mean by reading the patch several times. I would phrase it like > > > this (feel free to use if you consider this to be an improvement). > > > > > > Introduce a per-VMA rw_semaphore. The lock implementation relies on a > > > per-vma and per-mm sequence counters to note exclusive locking: > > > - read lock - (implemented by vma_read_trylock) requires the the > > > vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to > > > differ. If they match then there must be a vma exclusive lock > > > held somewhere. > > > - read unlock - (implemented by vma_read_unlock) is a trivial > > > vma->lock unlock. > > > - write lock - (vma_write_lock) requires the mmap_lock to be > > > held exclusively and the current mm counter is noted to the vma > > > side. This will allow multiple vmas to be locked under a single > > > mmap_lock write lock (e.g. during vma merging). The vma counter > > > is modified under exclusive vma lock. > > > > Didn't realize one more thing. > > Unlike standard write lock this implementation allows to be > > called multiple times under a single mmap_lock. In a sense > > it is more of mark_vma_potentially_modified than a lock. > > In the RFC it was called vma_mark_locked() originally and renames were > discussed in the email thread ending here: > https://lore.kernel.org/all/621612d7-c537-3971-9520-a3dec7b43cb4@xxxxxxx/. > If other names are preferable I'm open to changing them. I don't want to bikeshed this, but rather than locking it seems to be more: vma_start_read() vma_end_read() vma_start_write() vma_end_write() vma_downgrade_write() ... and that these are _implemented_ with locks (in part) is an implementation detail? Would that reduce people's confusion? > > > > > - write unlock - (vma_write_unlock_mm) is a batch release of all > > > vma locks held. It doesn't pair with a specific > > > vma_write_lock! It is done before exclusive mmap_lock is > > > released by incrementing mm sequence counter (mm_lock_seq). > > > - write downgrade - if the mmap_lock is downgraded to the read > > > lock all vma write locks are released as well (effectivelly > > > same as write unlock). > > -- > > Michal Hocko > > SUSE Labs