Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY

Andres Lagar-Cavilla <andreslc@xxxxxxxxxx> · Fri, 26 Sep 2014 12:54:46 -0700

On Fri, Sep 26, 2014 at 10:25 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> On Thu, Sep 25, 2014 at 02:50:29PM -0700, Andres Lagar-Cavilla wrote:
>> It's nearly impossible to name it right because 1) it indicates we can
>> relinquish 2) it returns whether we still hold the mmap semaphore.
>>
>> I'd prefer it'd be called mmap_sem_hold, which conveys immediately
>> what this is about ("nonblocking" or "locked" could be about a whole
>> lot of things)
>
> To me FOLL_NOWAIT/FAULT_FLAG_RETRY_NOWAIT is nonblocking,
> "locked"/FAULT_FLAG_ALLOW_RETRY is still very much blocking, just
> without the mmap_sem, so I called it "locked"... but I'm fine to
> change the name to mmap_sem_hold. Just get_user_pages_mmap_sem_hold
> seems less friendly than get_user_pages_locked(..., &locked). locked
> as you used comes intuitive when you do later "if (locked) up_read".
>

Heh. I was previously referring to the int *locked param , not the
_(un)locked suffix. That param is all about the mmap semaphore, so why
not name it less ambiguously. It's essentially a tristate.

My suggestion is that you just make gup behave as your proposed
gup_locked, and no need to introduce another call. But I understand if
you want to phase this out politely.

> Then I added an _unlocked kind which is a drop in replacement for many
> places just to clean it up.
>
> get_user_pages_unlocked and get_user_pages_fast are equivalent in
> semantics, so any call of get_user_pages_unlocked(current,
> current->mm, ...) has no reason to exist and should be replaced to
> get_user_pages_fast unless "force = 1" (gup_fast has no force param
> just to make the argument list a bit more confusing across the various
> versions of gup).
>
> get_user_pages over time should be phased out and dropped.

Please. Too many variants. So the end goal is
* __gup_fast
* gup_fast == __gup_fast + gup_unlocked for fallback
* gup (or gup_locked)
* gup_unlocked
(and flat __gup remains buried in the impl)?

>
>> I can see that. My background for coming into this is very similar: in
>> a previous life we had a file system shim that would kick up into
>> userspace for servicing VM memory. KVM just wouldn't let the file
>> system give up the mmap semaphore. We had /proc readers hanging up all
>> over the place while userspace was servicing. Not happy.
>>
>> With KVM (now) and the standard x86 fault giving you ALLOW_RETRY, what
>> stands in your way? Methinks that gup_fast has no slowpath fallback
>> that turns on ALLOW_RETRY. What would oppose that being the global
>> behavior?
>
> It should become the global behavior. Just it doesn't need to become a
> global behavior immediately for all kind of gups (i.e. video4linux
> drivers will never need to poke into the KVM guest user memory so it
> doesn't matter if they don't use gup_locked immediately). Even then we
> can still support get_user_pages_locked(...., locked=NULL) for
> ptrace/coredump and other things that may not want to trigger the
> userfaultfd protocol and just get an immediate VM_FAULT_SIGBUS.
>
> Userfaults will just VM_FAULT_SIGBUS (translated to -EFAULT by all gup
> invocations) and not invoke the userfaultfd protocol, if
> FAULT_FLAG_ALLOW_RETRY is not set. So any gup_locked with locked ==
> NULL or or gup() (without locked parameter) will not invoke the
> userfaultfd protocol.
>
> But I need gup_fast to use FAULT_FLAG_ALLOW_RETRY because core places
> like O_DIRECT uses it.
>
> I tried to do a RFC patch below that goes into this direction and
> should be enough for a start to solve all my issues with the mmap_sem
> holding inside handle_userfault(), comments welcome.
>
> =======
> From 41918f7d922d1e7fc70f117db713377e7e2af6e9 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Date: Fri, 26 Sep 2014 18:36:53 +0200
> Subject: [PATCH 1/2] mm: gup: add get_user_pages_locked and
>  get_user_pages_unlocked
>
> We can leverage the VM_FAULT_RETRY functionality in the page fault
> paths better by using either get_user_pages_locked or
> get_user_pages_unlocked.
>
> The former allow conversion of get_user_pages invocations that will
> have to pass a "&locked" parameter to know if the mmap_sem was dropped
> during the call. Example from:
>
>     down_read(&mm->mmap_sem);
>     do_something()
>     get_user_pages(tsk, mm, ..., pages, NULL);
>     up_read(&mm->mmap_sem);
>
> to:
>
>     int locked = 1;
>     down_read(&mm->mmap_sem);
>     do_something()
>     get_user_pages_locked(tsk, mm, ..., pages, &locked);
>     if (locked)
>         up_read(&mm->mmap_sem);
>
> The latter is suitable only as a drop in replacement of the form:
>
>     down_read(&mm->mmap_sem);
>     get_user_pages(tsk, mm, ..., pages, NULL);
>     up_read(&mm->mmap_sem);
>
> into:
>
>     get_user_pages_unlocked(tsk, mm, ..., pages);
>
> Where tsk, mm, the intermediate "..." paramters and "pages" can be any
> value as before. Just the last parameter of get_user_pages (vmas) must
> be NULL for get_user_pages_locked|unlocked to be usable (the latter
> original form wouldn't have been safe anyway if vmas wasn't null, for
> the former we just make it explicit by dropping the parameter).
>
> If vmas is not NULL these two methods cannot be used.
>
> This patch then applies the new forms in various places, in some case
> also replacing it with get_user_pages_fast whenever tsk and mm are
> current and current->mm. get_user_pages_unlocked varies from
> get_user_pages_fast only if mm is not current->mm (like when
> get_user_pages works on some other process mm). Whenever tsk and mm
> matches current and current->mm get_user_pages_fast must always be
> used to increase performance and get the page lockless (only with irq
> disabled).

Basically all this discussion should go into the patch as comments.
Help people shortcut git blame.

>
> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> ---
>  arch/mips/mm/gup.c                 |   8 +-
>  arch/powerpc/mm/gup.c              |   6 +-
>  arch/s390/kvm/kvm-s390.c           |   4 +-
>  arch/s390/mm/gup.c                 |   6 +-
>  arch/sh/mm/gup.c                   |   6 +-
>  arch/sparc/mm/gup.c                |   6 +-
>  arch/x86/mm/gup.c                  |   7 +-
>  drivers/dma/iovlock.c              |  10 +--
>  drivers/iommu/amd_iommu_v2.c       |   6 +-
>  drivers/media/pci/ivtv/ivtv-udma.c |   6 +-
>  drivers/misc/sgi-gru/grufault.c    |   3 +-
>  drivers/scsi/st.c                  |  10 +--
>  drivers/video/fbdev/pvr2fb.c       |   5 +-
>  include/linux/mm.h                 |   7 ++
>  mm/gup.c                           | 147 ++++++++++++++++++++++++++++++++++---
>  mm/mempolicy.c                     |   2 +-
>  mm/nommu.c                         |  23 ++++++
>  mm/process_vm_access.c             |   7 +-
>  mm/util.c                          |  10 +--
>  net/ceph/pagevec.c                 |   9 +--
>  20 files changed, 200 insertions(+), 88 deletions(-)
>
> diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
> index 06ce17c..20884f5 100644
> --- a/arch/mips/mm/gup.c
> +++ b/arch/mips/mm/gup.c
> @@ -301,11 +301,9 @@ slow_irqon:
>         start += nr << PAGE_SHIFT;
>         pages += nr;
>
> -       down_read(&mm->mmap_sem);
> -       ret = get_user_pages(current, mm, start,
> -                               (end - start) >> PAGE_SHIFT,
> -                               write, 0, pages, NULL);
> -       up_read(&mm->mmap_sem);
> +       ret = get_user_pages_unlocked(current, mm, start,
> +                                     (end - start) >> PAGE_SHIFT,
> +                                     write, 0, pages);
>
>         /* Have to be a bit careful with return values */
>         if (nr > 0) {
> diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
> index d874668..b70c34a 100644
> --- a/arch/powerpc/mm/gup.c
> +++ b/arch/powerpc/mm/gup.c
> @@ -215,10 +215,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                                    nr_pages - nr, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                                             nr_pages - nr, write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 81b0e11..37ca29a 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -1092,9 +1092,7 @@ long kvm_arch_fault_in_page(struct kvm_vcpu *vcpu, gpa_t gpa, int writable)
>         hva = gmap_fault(gpa, vcpu->arch.gmap);
>         if (IS_ERR_VALUE(hva))
>                 return (long)hva;
> -       down_read(&mm->mmap_sem);
> -       rc = get_user_pages(current, mm, hva, 1, writable, 0, NULL, NULL);
> -       up_read(&mm->mmap_sem);
> +       rc = get_user_pages_unlocked(current, mm, hva, 1, writable, 0, NULL);
>
>         return rc < 0 ? rc : 0;
>  }
> diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
> index 639fce46..5c586c7 100644
> --- a/arch/s390/mm/gup.c
> +++ b/arch/s390/mm/gup.c
> @@ -235,10 +235,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>         /* Try to get the remaining pages with get_user_pages */
>         start += nr << PAGE_SHIFT;
>         pages += nr;
> -       down_read(&mm->mmap_sem);
> -       ret = get_user_pages(current, mm, start,
> -                            nr_pages - nr, write, 0, pages, NULL);
> -       up_read(&mm->mmap_sem);
> +       ret = get_user_pages_unlocked(current, mm, start,
> +                            nr_pages - nr, write, 0, pages);
>         /* Have to be a bit careful with return values */
>         if (nr > 0)
>                 ret = (ret < 0) ? nr : ret + nr;
> diff --git a/arch/sh/mm/gup.c b/arch/sh/mm/gup.c
> index 37458f3..e15f52a 100644
> --- a/arch/sh/mm/gup.c
> +++ b/arch/sh/mm/gup.c
> @@ -257,10 +257,8 @@ slow_irqon:
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                       (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                       (end - start) >> PAGE_SHIFT, write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
> index 1aed043..fa7de7d 100644
> --- a/arch/sparc/mm/gup.c
> +++ b/arch/sparc/mm/gup.c
> @@ -219,10 +219,8 @@ slow:
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                       (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                       (end - start) >> PAGE_SHIFT, write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 207d9aef..2ab183b 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -388,10 +388,9 @@ slow_irqon:
>                 start += nr << PAGE_SHIFT;
>                 pages += nr;
>
> -               down_read(&mm->mmap_sem);
> -               ret = get_user_pages(current, mm, start,
> -                       (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> -               up_read(&mm->mmap_sem);
> +               ret = get_user_pages_unlocked(current, mm, start,
> +                                             (end - start) >> PAGE_SHIFT,
> +                                             write, 0, pages);
>
>                 /* Have to be a bit careful with return values */
>                 if (nr > 0) {
> diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
> index bb48a57..12ea7c3 100644
> --- a/drivers/dma/iovlock.c
> +++ b/drivers/dma/iovlock.c
> @@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len)
>                 pages += page_list->nr_pages;
>
>                 /* pin pages down */
> -               down_read(&current->mm->mmap_sem);
> -               ret = get_user_pages(
> -                       current,
> -                       current->mm,
> +               ret = get_user_pages_fast(
>                         (unsigned long) iov[i].iov_base,
>                         page_list->nr_pages,
>                         1,      /* write */
> -                       0,      /* force */
> -                       page_list->pages,
> -                       NULL);
> -               up_read(&current->mm->mmap_sem);
> +                       page_list->pages);
>
>                 if (ret != page_list->nr_pages)
>                         goto unpin;
> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 5f578e8..6963b73 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -519,10 +519,8 @@ static void do_fault(struct work_struct *work)
>
>         write = !!(fault->flags & PPR_FAULT_WRITE);
>
> -       down_read(&fault->state->mm->mmap_sem);
> -       npages = get_user_pages(NULL, fault->state->mm,
> -                               fault->address, 1, write, 0, &page, NULL);
> -       up_read(&fault->state->mm->mmap_sem);
> +       npages = get_user_pages_unlocked(NULL, fault->state->mm,
> +                                        fault->address, 1, write, 0, &page);
>
>         if (npages == 1) {
>                 put_page(page);
> diff --git a/drivers/media/pci/ivtv/ivtv-udma.c b/drivers/media/pci/ivtv/ivtv-udma.c
> index 7338cb2..96d866b 100644
> --- a/drivers/media/pci/ivtv/ivtv-udma.c
> +++ b/drivers/media/pci/ivtv/ivtv-udma.c
> @@ -124,10 +124,8 @@ int ivtv_udma_setup(struct ivtv *itv, unsigned long ivtv_dest_addr,
>         }
>
>         /* Get user pages for DMA Xfer */
> -       down_read(&current->mm->mmap_sem);
> -       err = get_user_pages(current, current->mm,
> -                       user_dma.uaddr, user_dma.page_count, 0, 1, dma->map, NULL);
> -       up_read(&current->mm->mmap_sem);
> +       err = get_user_pages_unlocked(current, current->mm,
> +                       user_dma.uaddr, user_dma.page_count, 0, 1, dma->map);
>
>         if (user_dma.page_count != err) {
>                 IVTV_DEBUG_WARN("failed to map user pages, returned %d instead of %d\n",
> diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> index f74fc0c..cd20669 100644
> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
>  #else
>         *pageshift = PAGE_SHIFT;
>  #endif
> -       if (get_user_pages
> -           (current, current->mm, vaddr, 1, write, 0, &page, NULL) <= 0)
> +       if (get_user_pages_fast(vaddr, 1, write, &page) <= 0)
>                 return -EFAULT;
>         *paddr = page_to_phys(page);
>         put_page(page);
> diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> index aff9689..c89dcfa 100644
> --- a/drivers/scsi/st.c
> +++ b/drivers/scsi/st.c
> @@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
>                 return -ENOMEM;
>
>          /* Try to fault in all of the necessary pages */
> -       down_read(&current->mm->mmap_sem);
>          /* rw==READ means read from drive, write into memory area */
> -       res = get_user_pages(
> -               current,
> -               current->mm,
> +       res = get_user_pages_fast(
>                 uaddr,
>                 nr_pages,
>                 rw == READ,
> -               0, /* don't force */
> -               pages,
> -               NULL);
> -       up_read(&current->mm->mmap_sem);
> +               pages);
>
>         /* Errors and no page mapped should return here */
>         if (res < nr_pages)
> diff --git a/drivers/video/fbdev/pvr2fb.c b/drivers/video/fbdev/pvr2fb.c
> index 167cfff..ff81f65 100644
> --- a/drivers/video/fbdev/pvr2fb.c
> +++ b/drivers/video/fbdev/pvr2fb.c
> @@ -686,10 +686,7 @@ static ssize_t pvr2fb_write(struct fb_info *info, const char *buf,
>         if (!pages)
>                 return -ENOMEM;
>
> -       down_read(&current->mm->mmap_sem);
> -       ret = get_user_pages(current, current->mm, (unsigned long)buf,
> -                            nr_pages, WRITE, 0, pages, NULL);
> -       up_read(&current->mm->mmap_sem);
> +       ret = get_user_pages_fast((unsigned long)buf, nr_pages, WRITE, pages);
>
>         if (ret < nr_pages) {
>                 nr_pages = ret;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 32ba786..69f692d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1197,6 +1197,13 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                     unsigned long start, unsigned long nr_pages,
>                     int write, int force, struct page **pages,
>                     struct vm_area_struct **vmas);
> +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> +                   unsigned long start, unsigned long nr_pages,
> +                   int write, int force, struct page **pages,
> +                   int *locked);
> +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> +                   unsigned long start, unsigned long nr_pages,
> +                   int write, int force, struct page **pages);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>                         struct page **pages);
>  struct kvec;
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..19e17ab 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -576,6 +576,134 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>         return 0;
>  }
>
> +static inline long __get_user_pages_locked(struct task_struct *tsk,
> +                                          struct mm_struct *mm,
> +                                          unsigned long start,
> +                                          unsigned long nr_pages,
> +                                          int write, int force,
> +                                          struct page **pages,
> +                                          struct vm_area_struct **vmas,
> +                                          int *locked,
> +                                          bool immediate_unlock)
s/immediate_unlock/notify_drop/
> +{
> +       int flags = FOLL_TOUCH;
> +       long ret, pages_done;
> +       bool lock_dropped;
s/lock_dropped/sem_dropped/
> +
> +       if (locked) {
> +               /* if VM_FAULT_RETRY can be returned, vmas become invalid */
> +               BUG_ON(vmas);
> +               /* check caller initialized locked */
> +               BUG_ON(*locked != 1);
> +       } else {
> +               /*
> +                * Not really important, the value is irrelevant if
> +                * locked is NULL, but BUILD_BUG_ON costs nothing.
> +                */
> +               BUILD_BUG_ON(immediate_unlock);
> +       }
> +
> +       if (pages)
> +               flags |= FOLL_GET;
> +       if (write)
> +               flags |= FOLL_WRITE;
> +       if (force)
> +               flags |= FOLL_FORCE;
> +
> +       pages_done = 0;
> +       lock_dropped = false;
> +       for (;;) {
> +               ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages,
> +                                      vmas, locked);
> +               if (!locked)
> +                       /* VM_FAULT_RETRY couldn't trigger, bypass */
> +                       return ret;
> +
> +               /* VM_FAULT_RETRY cannot return errors */
> +               if (!*locked) {

Set lock_dropped = 1. In case we break out too soon (which we do if
nr_pages drops to zero a couple lines below) and report a stale value.

> +                       BUG_ON(ret < 0);
> +                       BUG_ON(nr_pages == 1 && ret);
> +               }
> +
> +               if (!pages)
> +                       /* If it's a prefault don't insist harder */
> +                       return ret;
> +
> +               if (ret > 0) {
> +                       nr_pages -= ret;
> +                       pages_done += ret;
> +                       if (!nr_pages)
> +                               break;
> +               }
> +               if (*locked) {
> +                       /* VM_FAULT_RETRY didn't trigger */
> +                       if (!pages_done)
> +                               pages_done = ret;

Replace top two lines with
if (ret >0)
    pages_done += ret;

> +                       break;
> +               }
> +               /* VM_FAULT_RETRY triggered, so seek to the faulting offset */
> +               pages += ret;
> +               start += ret << PAGE_SHIFT;
> +
> +               /*
> +                * Repeat on the address that fired VM_FAULT_RETRY
> +                * without FAULT_FLAG_ALLOW_RETRY but with
> +                * FAULT_FLAG_TRIED.
> +                */
> +               *locked = 1;
> +               lock_dropped = true;

Not really needed if set where previously suggested.

> +               down_read(&mm->mmap_sem);
> +               ret = __get_user_pages(tsk, mm, start, nr_pages, flags | FOLL_TRIED,
> +                                      pages, NULL, NULL);

s/nr_pages/1/ otherwise we block on everything left ahead, not just
the one that fired RETRY.

> +               if (ret != 1) {
> +                       BUG_ON(ret > 1);

Can ret ever be zero here with count == 1? (ENOENT for a stack guard
page TTBOMK, but what the heck are we doing gup'ing stacks. Suggest
fixing that one case inside __gup impl so count == 1 never returns
zero)

> +                       if (!pages_done)
> +                               pages_done = ret;

Don't think so. ret is --errno at this point (maybe zero). So remove.

> +                       break;
> +               }
> +               nr_pages--;
> +               pages_done++;
> +               if (!nr_pages)
> +                       break;
> +               pages++;
> +               start += PAGE_SIZE;
> +       }
> +       if (!immediate_unlock && lock_dropped && *locked) {
> +               /*
> +                * We must let the caller know we temporarily dropped the lock
> +                * and so the critical section protected by it was lost.
> +                */
> +               up_read(&mm->mmap_sem);

With my suggestion of s/immediate_unlock/notify_drop/ this gets a lot
more understandable (IMHO).

> +               *locked = 0;
> +       }
> +       return pages_done;
> +}
> +
> +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> +                          unsigned long start, unsigned long nr_pages,
> +                          int write, int force, struct page **pages,
> +                          int *locked)
> +{
> +       return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> +                                      pages, NULL, locked, false);
> +}
> +EXPORT_SYMBOL(get_user_pages_locked);
> +
> +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> +                            unsigned long start, unsigned long nr_pages,
> +                            int write, int force, struct page **pages)
> +{
> +       long ret;
> +       int locked = 1;
> +       down_read(&mm->mmap_sem);
> +       ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> +                                     pages, NULL, &locked, true);
> +       if (locked)
> +               up_read(&mm->mmap_sem);
> +       return ret;
> +}
> +EXPORT_SYMBOL(get_user_pages_unlocked);
> +
>  /*
>   * get_user_pages() - pin user pages in memory
>   * @tsk:       the task_struct to use for page fault accounting, or
> @@ -625,22 +753,19 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>   * use the correct cache flushing APIs.
>   *
>   * See also get_user_pages_fast, for performance critical applications.
> + *
> + * get_user_pages should be gradually obsoleted in favor of
> + * get_user_pages_locked|unlocked. Nothing should use get_user_pages
> + * because it cannot pass FAULT_FLAG_ALLOW_RETRY to handle_mm_fault in
> + * turn disabling the userfaultfd feature (after that "inline" can be
> + * cleaned up from get_user_pages_locked).
>   */
>  long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                 unsigned long start, unsigned long nr_pages, int write,
>                 int force, struct page **pages, struct vm_area_struct **vmas)
>  {
> -       int flags = FOLL_TOUCH;
> -
> -       if (pages)
> -               flags |= FOLL_GET;
> -       if (write)
> -               flags |= FOLL_WRITE;
> -       if (force)
> -               flags |= FOLL_FORCE;
> -
> -       return __get_user_pages(tsk, mm, start, nr_pages, flags, pages, vmas,
> -                               NULL);
> +       return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
> +                                      pages, vmas, NULL, false);
>  }
>  EXPORT_SYMBOL(get_user_pages);

*Or*, forget about gup_locked and just leave gup as proposed in this
patch. Then gup_unlocked (again IMHO) becomes more meaningful ... "Ah,
that's the one I call when I have no locks taken".

>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8f5330d..6606c10 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -881,7 +881,7 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
>         struct page *p;
>         int err;
>
> -       err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
> +       err = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
>         if (err >= 0) {
>                 err = page_to_nid(p);
>                 put_page(p);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index a881d96..8a06341 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -213,6 +213,29 @@ long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL(get_user_pages);
>
> +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
> +                          unsigned long start, unsigned long nr_pages,
> +                          int write, int force, struct page **pages,
> +                          int *locked)
> +{
> +       return get_user_pages(tsk, mm, start, nr_pages, write, force,
> +                             pages, NULL);
> +}
> +EXPORT_SYMBOL(get_user_pages_locked);
> +
> +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
> +                            unsigned long start, unsigned long nr_pages,
> +                            int write, int force, struct page **pages)
> +{
> +       long ret;
> +       down_read(&mm->mmap_sem);
> +       ret = get_user_pages(tsk, mm, start, nr_pages, write, force,
> +                            pages, NULL);
> +       up_read(&mm->mmap_sem);
> +       return ret;
> +}
> +EXPORT_SYMBOL(get_user_pages_unlocked);
> +
>  /**
>   * follow_pfn - look up PFN at a user virtual address
>   * @vma: memory mapping
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 5077afc..b159769 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -99,11 +99,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
>                 size_t bytes;
>
>                 /* Get the pages we're interested in */
> -               down_read(&mm->mmap_sem);
> -               pages = get_user_pages(task, mm, pa, pages,
> -                                     vm_write, 0, process_pages, NULL);
> -               up_read(&mm->mmap_sem);
> -
> +               pages = get_user_pages_unlocked(task, mm, pa, pages,
> +                                               vm_write, 0, process_pages);
>                 if (pages <= 0)
>                         return -EFAULT;
>
> diff --git a/mm/util.c b/mm/util.c
> index 093c973..1b93f2d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -247,14 +247,8 @@ int __weak get_user_pages_fast(unsigned long start,
>                                 int nr_pages, int write, struct page **pages)
>  {
>         struct mm_struct *mm = current->mm;
> -       int ret;
> -
> -       down_read(&mm->mmap_sem);
> -       ret = get_user_pages(current, mm, start, nr_pages,
> -                                       write, 0, pages, NULL);
> -       up_read(&mm->mmap_sem);
> -
> -       return ret;
> +       return get_user_pages_unlocked(current, mm, start, nr_pages,
> +                                      write, 0, pages);
>  }
>  EXPORT_SYMBOL_GPL(get_user_pages_fast);
>
> diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c
> index 5550130..5504783 100644
> --- a/net/ceph/pagevec.c
> +++ b/net/ceph/pagevec.c
> @@ -23,17 +23,16 @@ struct page **ceph_get_direct_page_vector(const void __user *data,
>         if (!pages)
>                 return ERR_PTR(-ENOMEM);
>
> -       down_read(&current->mm->mmap_sem);
>         while (got < num_pages) {
> -               rc = get_user_pages(current, current->mm,
> -                   (unsigned long)data + ((unsigned long)got * PAGE_SIZE),
> -                   num_pages - got, write_page, 0, pages + got, NULL);
> +               rc = get_user_pages_fast((unsigned long)data +
> +                                        ((unsigned long)got * PAGE_SIZE),
> +                                        num_pages - got,
> +                                        write_page, pages + got);
>                 if (rc < 0)
>                         break;
>                 BUG_ON(rc == 0);
>                 got += rc;
>         }
> -       up_read(&current->mm->mmap_sem);
>         if (rc < 0)
>                 goto fail;
>         return pages;
>
>
> Then to make an example your patch would have become:
>
> ===
> From 74d88763cde285354fb78806ffb332030d1f0739 Mon Sep 17 00:00:00 2001
> From: Andres Lagar-Cavilla <andreslc@xxxxxxxxxx>
> Date: Fri, 26 Sep 2014 18:36:56 +0200
> Subject: [PATCH 2/2] kvm: Faults which trigger IO release the mmap_sem
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
> has been swapped out or is behind a filemap, this will trigger async
> readahead and return immediately. The rationale is that KVM will kick
> back the guest with an "async page fault" and allow for some other
> guest process to take over.
>
> If async PFs are enabled the fault is retried asap from an async
> workqueue. If not, it's retried immediately in the same code path. In
> either case the retry will not relinquish the mmap semaphore and will
> block on the IO. This is a bad thing, as other mmap semaphore users
> now stall as a function of swap or filemap latency.
>
> This patch ensures both the regular and async PF path re-enter the
> fault allowing for the mmap semaphore to be relinquished in the case
> of IO wait.
>
> Reviewed-by: Radim Krčmář <rkrcmar@xxxxxxxxxx>
> Signed-off-by: Andres Lagar-Cavilla <andreslc@xxxxxxxxxx>
> Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> ---
>  include/linux/mm.h  | 1 +
>  mm/gup.c            | 4 ++++
>  virt/kvm/async_pf.c | 4 +---
>  virt/kvm/kvm_main.c | 4 ++--
>  4 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 69f692d..71dbe03 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1997,6 +1997,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
>  #define FOLL_HWPOISON  0x100   /* check page is hwpoisoned */
>  #define FOLL_NUMA      0x200   /* force NUMA hinting page fault */
>  #define FOLL_MIGRATION 0x400   /* wait for page to replace migration entry */
> +#define FOLL_TRIED     0x800   /* a retry, previous pass started an IO */
>
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>                         void *data);
> diff --git a/mm/gup.c b/mm/gup.c
> index 19e17ab..369b3f6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>                 fault_flags |= FAULT_FLAG_ALLOW_RETRY;
>         if (*flags & FOLL_NOWAIT)
>                 fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
> +       if (*flags & FOLL_TRIED) {
> +               VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
> +               fault_flags |= FAULT_FLAG_TRIED;
> +       }
>
>         ret = handle_mm_fault(mm, vma, address, fault_flags);
>         if (ret & VM_FAULT_ERROR) {
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index d6a3d09..44660ae 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -80,9 +80,7 @@ static void async_pf_execute(struct work_struct *work)
>
>         might_sleep();
>
> -       down_read(&mm->mmap_sem);
> -       get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
> -       up_read(&mm->mmap_sem);
> +       get_user_pages_unlocked(NULL, mm, addr, 1, 1, 0, NULL);
>         kvm_async_page_present_sync(vcpu, apf);
>
>         spin_lock(&vcpu->async_pf.lock);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 95519bc..921bce7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1170,8 +1170,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
>                                               addr, write_fault, page);
>                 up_read(&current->mm->mmap_sem);
>         } else
> -               npages = get_user_pages_fast(addr, 1, write_fault,
> -                                            page);
> +               npages = get_user_pages_unlocked(current, current->mm, addr, 1,
> +                                                write_fault, 0, page);
>         if (npages != 1)
>                 return npages;

Acked, for the spirit. Likely my patch will go in and then you can
just throw this one on top, removing kvm_get_user_page_io in the
process.

>
>
>
> This isn't bisectable in this order and it's untested anyway. It needs
> more patchsplits.
>
> This is just an initial RFC to know if it's ok to go into this
> direction.
>
> If it's ok I'll do some testing and submit it more properly. If your
> patches goes in first it's fine and I'll just replace the call in KVM
> to get_user_pages_unlocked (or whatever we want to call that thing).
>
> I'd need to get this (or equivalent solution) merged before
> re-submitting the userfaultfd patchset. I think the above benefits the
> kernel as a whole in terms of mmap_sem holdtimes regardless of
> userfaultfd so it should be good.
>
>> Well, IIUC every code path that has ALLOW_RETRY dives in the second
>> time with FAULT_TRIED or similar. In the common case, you happily
>> blaze through the second time, but if someone raced in while all locks
>> were given up, one pays the price of the second time being a full
>> fault hogging the mmap sem. At some point you need to not keep being
>> polite otherwise the task starves. Presumably the risk of an extra
>> retry drops steeply every new gup retry. Maybe just try three times is
>> good enough. It makes for ugly logic though.
>
> I was under the idea that if one looped forever with VM_FAULT_RETRY
> it'd eventually succeed, but it risks doing more work, so I'm also
> sticking to the "locked != NULL" first, seek to the first page that
> returned VM_FAULT_RETRY and issue a nr_pages=1 gup with locked ==
> NULL, and then continue with locked != NULL at the next page. Just
> like you did in the KVM slow path. And if "pages" array is NULL I bail
> out at the first VM_FAULT_RETRY failure without insisting with further
> gup calls of the "&locked" kind, your patch had just 1 page but you
> also bailed out.
>
> What this code above does is basically to generalize your optimization
> to KVM and it makes it global and at the same time it avoids me
> trouble in handle_userfault().
>
> While at it I also converted some obvious candidate for gup_fast that
> had no point in running slower (which I should split off in a separate
> patch).

Yes to all.

The part that I'm missing is how would MADV_USERFAULT handle this. It
would be buried in faultin_page, if no RETRY possible raise sigbus,
otherwise drop the mmap semaphore and signal and sleep on the
userfaultfd?

Thanks,
Andres

>
> Thanks,
> Andrea

-- 
Andres Lagar-Cavilla | Google Kernel Team | andreslc@xxxxxxxxxx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href