On Fri, Jan 10, 2025 at 7:56 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Fri, Jan 10, 2025 at 6:33 AM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > > > On 1/9/25 3:30 AM, Suren Baghdasaryan wrote: > > > rw_semaphore is a sizable structure of 40 bytes and consumes > > > considerable space for each vm_area_struct. However vma_lock has > > > two important specifics which can be used to replace rw_semaphore > > > with a simpler structure: > > > 1. Readers never wait. They try to take the vma_lock and fall back to > > > mmap_lock if that fails. > > > 2. Only one writer at a time will ever try to write-lock a vma_lock > > > because writers first take mmap_lock in write mode. > > > Because of these requirements, full rw_semaphore functionality is not > > > needed and we can replace rw_semaphore and the vma->detached flag with > > > a refcount (vm_refcnt). > > > When vma is in detached state, vm_refcnt is 0 and only a call to > > > vma_mark_attached() can take it out of this state. Note that unlike > > > before, now we enforce both vma_mark_attached() and vma_mark_detached() > > > to be done only after vma has been write-locked. vma_mark_attached() > > > changes vm_refcnt to 1 to indicate that it has been attached to the vma > > > tree. When a reader takes read lock, it increments vm_refcnt, unless the > > > top usable bit of vm_refcnt (0x40000000) is set, indicating presence of > > > a writer. When writer takes write lock, it sets the top usable bit to > > > indicate its presence. If there are readers, writer will wait using newly > > > introduced mm->vma_writer_wait. Since all writers take mmap_lock in write > > > mode first, there can be only one writer at a time. The last reader to > > > release the lock will signal the writer to wake up. > > > refcount might overflow if there are many competing readers, in which case > > > read-locking will fail. Readers are expected to handle such failures. > > > In summary: > > > 1. all readers increment the vm_refcnt; > > > 2. writer sets top usable (writer) bit of vm_refcnt; > > > 3. readers cannot increment the vm_refcnt if the writer bit is set; > > > 4. in the presence of readers, writer must wait for the vm_refcnt to drop > > > to 1 (ignoring the writer bit), indicating an attached vma with no readers; > > > 5. vm_refcnt overflow is handled by the readers. > > > > > > Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > > > Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxx> > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > > > > Reviewed-by: Vlastimil Babka <vbabka@xxxxxxx> > > > > But think there's a problem that will manifest after patch 15. > > Also I don't feel qualified enough about the lockdep parts though > > (although I think I spotted another issue with those, below) so best if > > PeterZ can review those. > > Some nits below too. > > > > > + > > > +static inline void vma_refcount_put(struct vm_area_struct *vma) > > > +{ > > > + int oldcnt; > > > + > > > + if (!__refcount_dec_and_test(&vma->vm_refcnt, &oldcnt)) { > > > + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); > > > > Shouldn't we rwsem_release always? And also shouldn't it precede the > > refcount operation itself? > > Yes. Hillf pointed to the same issue. It will be fixed in the next version. > > > > > > + if (is_vma_writer_only(oldcnt - 1)) > > > + rcuwait_wake_up(&vma->vm_mm->vma_writer_wait); > > > > Hmm hmm we should maybe read the vm_mm pointer before dropping the > > refcount? In case this races in a way that is_vma_writer_only tests true > > but the writer meanwhile finishes and frees the vma. It's safe now but > > not after making the cache SLAB_TYPESAFE_BY_RCU ? > > Hmm. But if is_vma_writer_only() is true that means the writed is > blocked and is waiting for the reader to drop the vm_refcnt. IOW, it > won't proceed and free the vma until the reader calls > rcuwait_wake_up(). Your suggested change is trivial and I can do it > but I want to make sure I'm not missing something. Am I? Ok, after thinking some more, I think the race you might be referring to is this: writer reader __vma_enter_locked refcount_add_not_zero(VMA_LOCK_OFFSET, ...) vma_refcount_put __refcount_dec_and_test() if (is_vma_writer_only()) rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, ...) __vma_exit_locked refcount_sub_and_test(VMA_LOCK_OFFSET, ...) free the vma rcuwait_wake_up(&vma->vm_mm->vma_writer_wait); I think it's possible and your suggestion of storing the mm before doing __refcount_dec_and_test() should work. Thanks for pointing this out! I'll fix it in the next version. > > > > > > + } > > > +} > > > + > > > > > static inline void vma_end_read(struct vm_area_struct *vma) > > > { > > > rcu_read_lock(); /* keeps vma alive till the end of up_read */ > > > > This should refer to vma_refcount_put(). But after fixing it I think we > > could stop doing this altogether? It will no longer keep vma "alive" > > with SLAB_TYPESAFE_BY_RCU. > > Yeah, I think the comment along with rcu_read_lock()/rcu_read_unlock() > here can be safely removed. > > > > > > - up_read(&vma->vm_lock.lock); > > > + vma_refcount_put(vma); > > > rcu_read_unlock(); > > > } > > > > > > > <snip> > > > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -6370,9 +6370,41 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, > > > #endif > > > > > > #ifdef CONFIG_PER_VMA_LOCK > > > +static inline bool __vma_enter_locked(struct vm_area_struct *vma, unsigned int tgt_refcnt) > > > +{ > > > + /* > > > + * If vma is detached then only vma_mark_attached() can raise the > > > + * vm_refcnt. mmap_write_lock prevents racing with vma_mark_attached(). > > > + */ > > > + if (!refcount_add_not_zero(VMA_LOCK_OFFSET, &vma->vm_refcnt)) > > > + return false; > > > + > > > + rwsem_acquire(&vma->vmlock_dep_map, 0, 0, _RET_IP_); > > > + rcuwait_wait_event(&vma->vm_mm->vma_writer_wait, > > > + refcount_read(&vma->vm_refcnt) == tgt_refcnt, > > > + TASK_UNINTERRUPTIBLE); > > > + lock_acquired(&vma->vmlock_dep_map, _RET_IP_); > > > + > > > + return true; > > > +} > > > + > > > +static inline void __vma_exit_locked(struct vm_area_struct *vma, bool *detached) > > > +{ > > > + *detached = refcount_sub_and_test(VMA_LOCK_OFFSET, &vma->vm_refcnt); > > > + rwsem_release(&vma->vmlock_dep_map, _RET_IP_); > > > +} > > > + > > > void __vma_start_write(struct vm_area_struct *vma, unsigned int mm_lock_seq) > > > { > > > - down_write(&vma->vm_lock.lock); > > > + bool locked; > > > + > > > + /* > > > + * __vma_enter_locked() returns false immediately if the vma is not > > > + * attached, otherwise it waits until refcnt is (VMA_LOCK_OFFSET + 1) > > > + * indicating that vma is attached with no readers. > > > + */ > > > + locked = __vma_enter_locked(vma, VMA_LOCK_OFFSET + 1); > > > > Wonder if it would be slightly better if tgt_refcount was just 1 (or 0 > > below in vma_mark_detached()) and the VMA_LOCK_OFFSET added to it in > > __vma_enter_locked() itself as it's the one adding it in the first place. > > Well, it won't be called tgt_refcount then. Maybe "bool vma_attached" > and inside __vma_enter_locked() we do: > > unsigned int tgt_refcnt = VMA_LOCK_OFFSET + vma_attached ? 1 : 0; > > Is that better? > > >