* Lokesh Gidra <lokeshgidra@xxxxxxxxxx> [240205 20:10]: > All userfaultfd operations, except write-protect, opportunistically use > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock > critical section. > > Write-protect operation requires mmap_lock as it iterates over multiple > vmas. > > Signed-off-by: Lokesh Gidra <lokeshgidra@xxxxxxxxxx> > --- > fs/userfaultfd.c | 13 +- > include/linux/mm.h | 16 +++ > include/linux/userfaultfd_k.h | 5 +- > mm/memory.c | 48 +++++++ > mm/userfaultfd.c | 242 +++++++++++++++++++++------------- > 5 files changed, 222 insertions(+), 102 deletions(-) > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index c00a021bcce4..60dcfafdc11a 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -2005,17 +2005,8 @@ static int userfaultfd_move(struct userfaultfd_ctx *ctx, > return -EINVAL; > > if (mmget_not_zero(mm)) { > - mmap_read_lock(mm); > - > - /* Re-check after taking map_changing_lock */ > - down_read(&ctx->map_changing_lock); > - if (likely(!atomic_read(&ctx->mmap_changing))) > - ret = move_pages(ctx, mm, uffdio_move.dst, uffdio_move.src, > - uffdio_move.len, uffdio_move.mode); > - else > - ret = -EAGAIN; > - up_read(&ctx->map_changing_lock); > - mmap_read_unlock(mm); > + ret = move_pages(ctx, uffdio_move.dst, uffdio_move.src, > + uffdio_move.len, uffdio_move.mode); > mmput(mm); > } else { > return -ESRCH; > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 0d1f98ab0c72..e69dfe2edcce 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -753,6 +753,11 @@ static inline void release_fault_lock(struct vm_fault *vmf) > mmap_read_unlock(vmf->vma->vm_mm); > } > > +static inline void unlock_vma(struct mm_struct *mm, struct vm_area_struct *vma) > +{ > + vma_end_read(vma); > +} > + > static inline void assert_fault_locked(struct vm_fault *vmf) > { > if (vmf->flags & FAULT_FLAG_VMA_LOCK) > @@ -774,6 +779,9 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma) > { mmap_assert_write_locked(vma->vm_mm); } > static inline void vma_mark_detached(struct vm_area_struct *vma, > bool detached) {} > +static inline void vma_acquire_read_lock(struct vm_area_struct *vma) { > + mmap_assert_locked(vma->vm_mm); > +} > > static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, > unsigned long address) > @@ -786,6 +794,11 @@ static inline void release_fault_lock(struct vm_fault *vmf) > mmap_read_unlock(vmf->vma->vm_mm); > } > > +static inline void unlock_vma(struct mm_struct *mm, struct vm_area_struct *vma) > +{ > + mmap_read_unlock(mm); > +} > + Instead of passing two variables and only using one based on configuration of kernel build, why not use vma->vm_mm to mmap_read_unlock() and just pass the vma? It is odd to call unlock_vma() which maps to mmap_read_unlock(). Could we have this abstraction depend on CONFIG_PER_VMA_LOCK in uffd so that reading the code remains clear? You seem to have pretty much two versions of each function already. If you do that, then we can leave unlock_vma() undefined if !CONFIG_PER_VMA_LOCK. > static inline void assert_fault_locked(struct vm_fault *vmf) > { > mmap_assert_locked(vmf->vma->vm_mm); > @@ -794,6 +807,9 @@ static inline void assert_fault_locked(struct vm_fault *vmf) > #endif /* CONFIG_PER_VMA_LOCK */ > > extern const struct vm_operations_struct vma_dummy_vm_ops; > +extern struct vm_area_struct *lock_vma(struct mm_struct *mm, > + unsigned long address, > + bool prepare_anon); > > /* > * WARNING: vma_init does not initialize vma->vm_lock. > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h > index 3210c3552976..05d59f74fc88 100644 > --- a/include/linux/userfaultfd_k.h > +++ b/include/linux/userfaultfd_k.h > @@ -138,9 +138,8 @@ extern long uffd_wp_range(struct vm_area_struct *vma, > /* move_pages */ > void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2); > void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2); > -ssize_t move_pages(struct userfaultfd_ctx *ctx, struct mm_struct *mm, > - unsigned long dst_start, unsigned long src_start, > - unsigned long len, __u64 flags); > +ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start, > + unsigned long src_start, unsigned long len, __u64 flags); > int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pmd_t dst_pmdval, > struct vm_area_struct *dst_vma, > struct vm_area_struct *src_vma, > diff --git a/mm/memory.c b/mm/memory.c > index b05fd28dbce1..393ab3b0d6f3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5760,8 +5760,56 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, > count_vm_vma_lock_event(VMA_LOCK_ABORT); > return NULL; > } > + > +static void vma_acquire_read_lock(struct vm_area_struct *vma) > +{ > + /* > + * We cannot use vma_start_read() as it may fail due to false locked > + * (see comment in vma_start_read()). We can avoid that by directly > + * locking vm_lock under mmap_lock, which guarantees that nobody could > + * have locked the vma for write (vma_start_write()). > + */ > + mmap_assert_locked(vma->vm_mm); > + down_read(&vma->vm_lock->lock); > +} > #endif /* CONFIG_PER_VMA_LOCK */ > > +/* > + * lock_vma() - Lookup and lock VMA corresponding to @address. Missing arguments in the comment > + * @prepare_anon: If true, then prepare the VMA (if anonymous) with anon_vma. > + * > + * Should be called without holding mmap_lock. VMA should be unlocked after use > + * with unlock_vma(). > + * > + * Return: A locked VMA containing @address, NULL of no VMA is found, or > + * -ENOMEM if anon_vma couldn't be allocated. > + */ > +struct vm_area_struct *lock_vma(struct mm_struct *mm, > + unsigned long address, > + bool prepare_anon) > +{ > + struct vm_area_struct *vma; > + > + vma = lock_vma_under_rcu(mm, address); > + Nit: extra new line > + if (vma) > + return vma; > + > + mmap_read_lock(mm); > + vma = vma_lookup(mm, address); > + if (vma) { > + if (prepare_anon && vma_is_anonymous(vma) && > + anon_vma_prepare(vma)) > + vma = ERR_PTR(-ENOMEM); > + else > + vma_acquire_read_lock(vma); > + } > + > + if (IS_ENABLED(CONFIG_PER_VMA_LOCK) || !vma || PTR_ERR(vma) == -ENOMEM) > + mmap_read_unlock(mm); > + return vma; > +} > + It is also very odd that lock_vma() may, in fact, be locking the mm. It seems like there is a layer of abstraction missing here, where your code would either lock the vma or lock the mm - like you had before, but without the confusing semantics of unlocking with a flag. That is, we know what to do to unlock based on CONFIG_PER_VMA_LOCK, but it isn't always used. Maybe my comments were not clear on what I was thinking on the locking plan. I was thinking that, in the CONFIG_PER_VMA_LOCK case, you could have a lock_vma() which does the per-vma locking which you can use in your code. You could call lock_vma() in some uffd helper function that would do what is required (limit checking, etc) and return a locked vma. The counterpart of that would be another helper function that would do what was required under the mmap_read lock (limit check, etc). The unlocking would be entirely config dependant as you have today. Just write the few functions you have twice: once for per-vma lock support, once without it. Since we now can ensure the per-vma lock is taken in the per-vma lock path (or it failed), then you don't need to mmap_locked boolean you had in the previous version. You solved the unlock issue already, but it should be abstracted so uffd calls the underlying unlock vs vma_unlock() doing an mmap_read_unlock() - because that's very confusing to see. I'd drop the vma from the function names that lock the mm or the vma as well. Thanks, Liam