Re: [PATCH v3 3/3] userfaultfd: use per-vma locks in userfaultfd operations

"Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx> · Wed, 7 Feb 2024 13:52:26 -0500

* Lokesh Gidra <lokeshgidra@xxxxxxxxxx> [240207 13:48]:
> On Tue, Feb 6, 2024 at 9:05 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
> >
> > * Lokesh Gidra <lokeshgidra@xxxxxxxxxx> [240205 20:10]:
> > > All userfaultfd operations, except write-protect, opportunistically use
> > > per-vma locks to lock vmas. On failure, attempt again inside mmap_lock
> > > critical section.
> > >
> > > Write-protect operation requires mmap_lock as it iterates over multiple
> > > vmas.
> > >
> > > Signed-off-by: Lokesh Gidra <lokeshgidra@xxxxxxxxxx>
> > > ---
> > >  fs/userfaultfd.c              |  13 +-
> > >  include/linux/mm.h            |  16 +++
> > >  include/linux/userfaultfd_k.h |   5 +-
> > >  mm/memory.c                   |  48 +++++++
> > >  mm/userfaultfd.c              | 242 +++++++++++++++++++++-------------
> > >  5 files changed, 222 insertions(+), 102 deletions(-)
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index c00a021bcce4..60dcfafdc11a 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -2005,17 +2005,8 @@ static int userfaultfd_move(struct userfaultfd_ctx *ctx,
> > >               return -EINVAL;
> > >
> > >       if (mmget_not_zero(mm)) {
> > > -             mmap_read_lock(mm);
> > > -
> > > -             /* Re-check after taking map_changing_lock */
> > > -             down_read(&ctx->map_changing_lock);
> > > -             if (likely(!atomic_read(&ctx->mmap_changing)))
> > > -                     ret = move_pages(ctx, mm, uffdio_move.dst, uffdio_move.src,
> > > -                                      uffdio_move.len, uffdio_move.mode);
> > > -             else
> > > -                     ret = -EAGAIN;
> > > -             up_read(&ctx->map_changing_lock);
> > > -             mmap_read_unlock(mm);
> > > +             ret = move_pages(ctx, uffdio_move.dst, uffdio_move.src,
> > > +                              uffdio_move.len, uffdio_move.mode);
> > >               mmput(mm);
> > >       } else {
> > >               return -ESRCH;
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 0d1f98ab0c72..e69dfe2edcce 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -753,6 +753,11 @@ static inline void release_fault_lock(struct vm_fault *vmf)
> > >               mmap_read_unlock(vmf->vma->vm_mm);
> > >  }
> > >
> > > +static inline void unlock_vma(struct mm_struct *mm, struct vm_area_struct *vma)
> > > +{
> > > +     vma_end_read(vma);
> > > +}
> > > +
> > >  static inline void assert_fault_locked(struct vm_fault *vmf)
> > >  {
> > >       if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > @@ -774,6 +779,9 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
> > >               { mmap_assert_write_locked(vma->vm_mm); }
> > >  static inline void vma_mark_detached(struct vm_area_struct *vma,
> > >                                    bool detached) {}
> > > +static inline void vma_acquire_read_lock(struct vm_area_struct *vma) {
> > > +     mmap_assert_locked(vma->vm_mm);
> > > +}
> > >
> > >  static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > >               unsigned long address)
> > > @@ -786,6 +794,11 @@ static inline void release_fault_lock(struct vm_fault *vmf)
> > >       mmap_read_unlock(vmf->vma->vm_mm);
> > >  }
> > >
> > > +static inline void unlock_vma(struct mm_struct *mm, struct vm_area_struct *vma)
> > > +{
> > > +     mmap_read_unlock(mm);
> > > +}
> > > +
> >
> > Instead of passing two variables and only using one based on
> > configuration of kernel build, why not use vma->vm_mm to
> > mmap_read_unlock() and just pass the vma?
> >
> > It is odd to call unlock_vma() which maps to mmap_read_unlock().  Could
> > we have this abstraction depend on CONFIG_PER_VMA_LOCK in uffd so that
> > reading the code remains clear?  You seem to have pretty much two
> > versions of each function already.  If you do that, then we can leave
> > unlock_vma() undefined if !CONFIG_PER_VMA_LOCK.
> >
> > >  static inline void assert_fault_locked(struct vm_fault *vmf)
> > >  {
> > >       mmap_assert_locked(vmf->vma->vm_mm);
> > > @@ -794,6 +807,9 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
> > >  #endif /* CONFIG_PER_VMA_LOCK */
> > >
> > >  extern const struct vm_operations_struct vma_dummy_vm_ops;
> > > +extern struct vm_area_struct *lock_vma(struct mm_struct *mm,
> > > +                                    unsigned long address,
> > > +                                    bool prepare_anon);
> > >
> > >  /*
> > >   * WARNING: vma_init does not initialize vma->vm_lock.
> > > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > > index 3210c3552976..05d59f74fc88 100644
> > > --- a/include/linux/userfaultfd_k.h
> > > +++ b/include/linux/userfaultfd_k.h
> > > @@ -138,9 +138,8 @@ extern long uffd_wp_range(struct vm_area_struct *vma,
> > >  /* move_pages */
> > >  void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
> > >  void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
> > > -ssize_t move_pages(struct userfaultfd_ctx *ctx, struct mm_struct *mm,
> > > -                unsigned long dst_start, unsigned long src_start,
> > > -                unsigned long len, __u64 flags);
> > > +ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
> > > +                unsigned long src_start, unsigned long len, __u64 flags);
> > >  int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pmd_t dst_pmdval,
> > >                       struct vm_area_struct *dst_vma,
> > >                       struct vm_area_struct *src_vma,
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index b05fd28dbce1..393ab3b0d6f3 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5760,8 +5760,56 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > >       count_vm_vma_lock_event(VMA_LOCK_ABORT);
> > >       return NULL;
> > >  }
> > > +
> > > +static void vma_acquire_read_lock(struct vm_area_struct *vma)
> > > +{
> > > +     /*
> > > +      * We cannot use vma_start_read() as it may fail due to false locked
> > > +      * (see comment in vma_start_read()). We can avoid that by directly
> > > +      * locking vm_lock under mmap_lock, which guarantees that nobody could
> > > +      * have locked the vma for write (vma_start_write()).
> > > +      */
> > > +     mmap_assert_locked(vma->vm_mm);
> > > +     down_read(&vma->vm_lock->lock);
> > > +}
> > >  #endif /* CONFIG_PER_VMA_LOCK */
> > >
> > > +/*
> > > + * lock_vma() - Lookup and lock VMA corresponding to @address.
> >
> > Missing arguments in the comment
> >
> > > + * @prepare_anon: If true, then prepare the VMA (if anonymous) with anon_vma.
> > > + *
> > > + * Should be called without holding mmap_lock. VMA should be unlocked after use
> > > + * with unlock_vma().
> > > + *
> > > + * Return: A locked VMA containing @address, NULL of no VMA is found, or
> > > + * -ENOMEM if anon_vma couldn't be allocated.
> > > + */
> > > +struct vm_area_struct *lock_vma(struct mm_struct *mm,
> > > +                             unsigned long address,
> > > +                             bool prepare_anon)
> > > +{
> > > +     struct vm_area_struct *vma;
> > > +
> > > +     vma = lock_vma_under_rcu(mm, address);
> > > +
> >
> > Nit: extra new line
> >
> > > +     if (vma)
> > > +             return vma;
> > > +
> > > +     mmap_read_lock(mm);
> > > +     vma = vma_lookup(mm, address);
> > > +     if (vma) {
> > > +             if (prepare_anon && vma_is_anonymous(vma) &&
> > > +                 anon_vma_prepare(vma))
> > > +                     vma = ERR_PTR(-ENOMEM);
> > > +             else
> > > +                     vma_acquire_read_lock(vma);
> > > +     }
> > > +
> > > +     if (IS_ENABLED(CONFIG_PER_VMA_LOCK) || !vma || PTR_ERR(vma) == -ENOMEM)
> > > +             mmap_read_unlock(mm);
> > > +     return vma;
> > > +}
> > > +
> >
> > It is also very odd that lock_vma() may, in fact, be locking the mm.  It
> > seems like there is a layer of abstraction missing here, where your code
> > would either lock the vma or lock the mm - like you had before, but
> > without the confusing semantics of unlocking with a flag.  That is, we
> > know what to do to unlock based on CONFIG_PER_VMA_LOCK, but it isn't
> > always used.
> >
> > Maybe my comments were not clear on what I was thinking on the locking
> > plan.  I was thinking that, in the CONFIG_PER_VMA_LOCK case, you could
> > have a lock_vma() which does the per-vma locking which you can use in
> > your code.  You could call lock_vma() in some uffd helper function that
> > would do what is required (limit checking, etc) and return a locked vma.
> >
> > The counterpart of that would be another helper function that would do
> > what was required under the mmap_read lock (limit check, etc).  The
> > unlocking would be entirely config dependant as you have today.
> >
> > Just write the few functions you have twice: once for per-vma lock
> > support, once without it.  Since we now can ensure the per-vma lock is
> > taken in the per-vma lock path (or it failed), then you don't need to
> > mmap_locked boolean you had in the previous version.  You solved the
> > unlock issue already, but it should be abstracted so uffd calls the
> > underlying unlock vs vma_unlock() doing an mmap_read_unlock() - because
> > that's very confusing to see.
> >
> > I'd drop the vma from the function names that lock the mm or the vma as
> > well.
> >
> > Thanks,
> > Liam
> 
> I got it now. I'll make the changes in the next version.
> 
> Would it be ok to make lock_vma()/unlock_vma() (in case of
> CONFIG_PER_VMA_LOCK) also be defined in mm/userfaultfd.c? The reason I
> say this is because first there are no other users of these functions.
> And also due to what Jann pointed out about anon_vma.
> lock_vma_under_rcu() (rightly) only checks for private+anonymous case
> and not private+file-backed case. So lock_vma() implementation is
> getting very userfaultfd specific IMO.

Yes, this sounds reasonable.  Looking forward to the next revision.

Thanks,
Liam