Re: [PATCH V4 1/6] mm: Introduce a general RCU get_user_pages_fast.

Hugh Dickins <hughd@xxxxxxxxxx> · Mon, 29 Sep 2014 14:51:25 -0700 (PDT)

On Fri, 26 Sep 2014, Steve Capper wrote:

> get_user_pages_fast attempts to pin user pages by walking the page
> tables directly and avoids taking locks. Thus the walker needs to be
> protected from page table pages being freed from under it, and needs
> to block any THP splits.
> 
> One way to achieve this is to have the walker disable interrupts, and
> rely on IPIs from the TLB flushing code blocking before the page table
> pages are freed.
> 
> On some platforms we have hardware broadcast of TLB invalidations, thus
> the TLB flushing code doesn't necessarily need to broadcast IPIs; and
> spuriously broadcasting IPIs can hurt system performance if done too
> often.
> 
> This problem has been solved on PowerPC and Sparc by batching up page
> table pages belonging to more than one mm_user, then scheduling an
> rcu_sched callback to free the pages. This RCU page table free logic
> has been promoted to core code and is activated when one enables
> HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement
> their own get_user_pages_fast routines.
> 
> The RCU page table free logic coupled with a an IPI broadcast on THP
> split (which is a rare event), allows one to protect a page table
> walker by merely disabling the interrupts during the walk.
> 
> This patch provides a general RCU implementation of get_user_pages_fast
> that can be used by architectures that perform hardware broadcast of
> TLB invalidations.
> 
> It is based heavily on the PowerPC implementation by Nick Piggin.
> 
> Signed-off-by: Steve Capper <steve.capper@xxxxxxxxxx>
> Tested-by: Dann Frazier <dann.frazier@xxxxxxxxxxxxx>
> Reviewed-by: Catalin Marinas <catalin.marinas@xxxxxxx>

Acked-by: Hugh Dickins <hughd@xxxxxxxxxx>

Thanks for making all those clarifications, Steve: this looks very
good to me now.  I'm not sure which tree you're hoping will take this
and the arm+arm64 patches 2-6: although this one would normally go
through akpm, I expect it's easier for you to synchronize if it goes
in along with the arm+arm64 2-6 - would that be okay with you, Andrew?
I see no clash with what's currently in mmotm.

> ---
> Changed in V4:
>  * Added pte_numa and pmd_numa calls.
>  * Added comments to clarify what assumptions are being made by the
>    implementation.
>  * Cleaned up formatting for checkpatch.
> 
> Catalin, I've kept your Reviewed-by, please shout if you dislike the
> pte_numa and pmd_numa calls.
> ---
>  mm/Kconfig |   3 +
>  mm/gup.c   | 354 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 357 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 886db21..0ceb8a5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -137,6 +137,9 @@ config HAVE_MEMBLOCK_NODE_MAP
>  config HAVE_MEMBLOCK_PHYS_MAP
>  	boolean
>  
> +config HAVE_GENERIC_RCU_GUP
> +	boolean
> +
>  config ARCH_DISCARD_MEMBLOCK
>  	boolean
>  
> diff --git a/mm/gup.c b/mm/gup.c
> index 91d044b..35c0160 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -10,6 +10,10 @@
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
>  
> +#include <linux/sched.h>
> +#include <linux/rwsem.h>
> +#include <asm/pgtable.h>
> +
>  #include "internal.h"
>  
>  static struct page *no_page_table(struct vm_area_struct *vma,
> @@ -672,3 +676,353 @@ struct page *get_dump_page(unsigned long addr)
>  	return page;
>  }
>  #endif /* CONFIG_ELF_CORE */
> +
> +/**
> + * Generic RCU Fast GUP
> + *
> + * get_user_pages_fast attempts to pin user pages by walking the page
> + * tables directly and avoids taking locks. Thus the walker needs to be
> + * protected from page table pages being freed from under it, and should
> + * block any THP splits.
> + *
> + * One way to achieve this is to have the walker disable interrupts, and
> + * rely on IPIs from the TLB flushing code blocking before the page table
> + * pages are freed. This is unsuitable for architectures that do not need
> + * to broadcast an IPI when invalidating TLBs.
> + *
> + * Another way to achieve this is to batch up page table containing pages
> + * belonging to more than one mm_user, then rcu_sched a callback to free those
> + * pages. Disabling interrupts will allow the fast_gup walker to both block
> + * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
> + * (which is a relatively rare event). The code below adopts this strategy.
> + *
> + * Before activating this code, please be aware that the following assumptions
> + * are currently made:
> + *
> + *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
> + *      pages containing page tables.
> + *
> + *  *) THP splits will broadcast an IPI, this can be achieved by overriding
> + *      pmdp_splitting_flush.
> + *
> + *  *) ptes can be read atomically by the architecture.
> + *
> + *  *) access_ok is sufficient to validate userspace address ranges.
> + *
> + * The last two assumptions can be relaxed by the addition of helper functions.
> + *
> + * This code is based heavily on the PowerPC implementation by Nick Piggin.
> + */
> +#ifdef CONFIG_HAVE_GENERIC_RCU_GUP
> +
> +#ifdef __HAVE_ARCH_PTE_SPECIAL
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	pte_t *ptep, *ptem;
> +	int ret = 0;
> +
> +	ptem = ptep = pte_offset_map(&pmd, addr);
> +	do {
> +		/*
> +		 * In the line below we are assuming that the pte can be read
> +		 * atomically. If this is not the case for your architecture,
> +		 * please wrap this in a helper function!
> +		 *
> +		 * for an example see gup_get_pte in arch/x86/mm/gup.c
> +		 */
> +		pte_t pte = ACCESS_ONCE(*ptep);
> +		struct page *page;
> +
> +		/*
> +		 * Similar to the PMD case below, NUMA hinting must take slow
> +		 * path
> +		 */
> +		if (!pte_present(pte) || pte_special(pte) ||
> +			pte_numa(pte) || (write && !pte_write(pte)))
> +			goto pte_unmap;
> +
> +		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> +		page = pte_page(pte);
> +
> +		if (!page_cache_get_speculative(page))
> +			goto pte_unmap;
> +
> +		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> +			put_page(page);
> +			goto pte_unmap;
> +		}
> +
> +		pages[*nr] = page;
> +		(*nr)++;
> +
> +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> +
> +	ret = 1;
> +
> +pte_unmap:
> +	pte_unmap(ptem);
> +	return ret;
> +}
> +#else
> +
> +/*
> + * If we can't determine whether or not a pte is special, then fail immediately
> + * for ptes. Note, we can still pin HugeTLB and THP as these are guaranteed not
> + * to be special.
> + *
> + * For a futex to be placed on a THP tail page, get_futex_key requires a
> + * __get_user_pages_fast implementation that can pin pages. Thus it's still
> + * useful to have gup_huge_pmd even if we can't operate on ptes.
> + */
> +static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> +			 int write, struct page **pages, int *nr)
> +{
> +	return 0;
> +}
> +#endif /* __HAVE_ARCH_PTE_SPECIAL */
> +
> +static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pmd_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pmd_page(orig);
> +	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	/*
> +	 * Any tail pages need their mapcount reference taken before we
> +	 * return. (This allows the THP code to bump their ref count when
> +	 * they are split into base pages).
> +	 */
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
> +static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> +		unsigned long end, int write, struct page **pages, int *nr)
> +{
> +	struct page *head, *page, *tail;
> +	int refs;
> +
> +	if (write && !pud_write(orig))
> +		return 0;
> +
> +	refs = 0;
> +	head = pud_page(orig);
> +	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> +	tail = page;
> +	do {
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
> +		pages[*nr] = page;
> +		(*nr)++;
> +		page++;
> +		refs++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +
> +	if (!page_cache_add_speculative(head, refs)) {
> +		*nr -= refs;
> +		return 0;
> +	}
> +
> +	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> +		*nr -= refs;
> +		while (refs--)
> +			put_page(head);
> +		return 0;
> +	}
> +
> +	while (refs--) {
> +		if (PageTail(tail))
> +			get_huge_page_tail(tail);
> +		tail++;
> +	}
> +
> +	return 1;
> +}
> +
> +static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pmd_t *pmdp;
> +
> +	pmdp = pmd_offset(&pud, addr);
> +	do {
> +		pmd_t pmd = ACCESS_ONCE(*pmdp);
> +
> +		next = pmd_addr_end(addr, end);
> +		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
> +			return 0;
> +
> +		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
> +			/*
> +			 * NUMA hinting faults need to be handled in the GUP
> +			 * slowpath for accounting purposes and so that they
> +			 * can be serialised against THP migration.
> +			 */
> +			if (pmd_numa(pmd))
> +				return 0;
> +
> +			if (!gup_huge_pmd(pmd, pmdp, addr, next, write,
> +				pages, nr))
> +				return 0;
> +
> +		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
> +				return 0;
> +	} while (pmdp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +static int gup_pud_range(pgd_t *pgdp, unsigned long addr, unsigned long end,
> +		int write, struct page **pages, int *nr)
> +{
> +	unsigned long next;
> +	pud_t *pudp;
> +
> +	pudp = pud_offset(pgdp, addr);
> +	do {
> +		pud_t pud = ACCESS_ONCE(*pudp);
> +
> +		next = pud_addr_end(addr, end);
> +		if (pud_none(pud))
> +			return 0;
> +		if (pud_huge(pud)) {
> +			if (!gup_huge_pud(pud, pudp, addr, next, write,
> +					pages, nr))
> +				return 0;
> +		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
> +			return 0;
> +	} while (pudp++, addr = next, addr != end);
> +
> +	return 1;
> +}
> +
> +/*
> + * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
> + * back to the regular GUP. It will only return non-negative values.
> + */
> +int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			  struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long addr, len, end;
> +	unsigned long next, flags;
> +	pgd_t *pgdp;
> +	int nr = 0;
> +
> +	start &= PAGE_MASK;
> +	addr = start;
> +	len = (unsigned long) nr_pages << PAGE_SHIFT;
> +	end = start + len;
> +
> +	if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> +					start, len)))
> +		return 0;
> +
> +	/*
> +	 * Disable interrupts, we use the nested form as we can already
> +	 * have interrupts disabled by get_futex_key.
> +	 *
> +	 * With interrupts disabled, we block page table pages from being
> +	 * freed from under us. See mmu_gather_tlb in asm-generic/tlb.h
> +	 * for more details.
> +	 *
> +	 * We do not adopt an rcu_read_lock(.) here as we also want to
> +	 * block IPIs that come from THPs splitting.
> +	 */
> +
> +	local_irq_save(flags);
> +	pgdp = pgd_offset(mm, addr);
> +	do {
> +		next = pgd_addr_end(addr, end);
> +		if (pgd_none(*pgdp))
> +			break;
> +		else if (!gup_pud_range(pgdp, addr, next, write, pages, &nr))
> +			break;
> +	} while (pgdp++, addr = next, addr != end);
> +	local_irq_restore(flags);
> +
> +	return nr;
> +}
> +
> +/**
> + * get_user_pages_fast() - pin user pages in memory
> + * @start:	starting user address
> + * @nr_pages:	number of pages from start to pin
> + * @write:	whether pages will be written to
> + * @pages:	array that receives pointers to the pages pinned.
> + *		Should be at least nr_pages long.
> + *
> + * Attempt to pin user pages in memory without taking mm->mmap_sem.
> + * If not successful, it will fall back to taking the lock and
> + * calling get_user_pages().
> + *
> + * Returns number of pages pinned. This may be fewer than the number
> + * requested. If nr_pages is 0 or negative, returns 0. If no pages
> + * were pinned, returns -errno.
> + */
> +int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> +			struct page **pages)
> +{
> +	struct mm_struct *mm = current->mm;
> +	int nr, ret;
> +
> +	start &= PAGE_MASK;
> +	nr = __get_user_pages_fast(start, nr_pages, write, pages);
> +	ret = nr;
> +
> +	if (nr < nr_pages) {
> +		/* Try to get the remaining pages with get_user_pages */
> +		start += nr << PAGE_SHIFT;
> +		pages += nr;
> +
> +		down_read(&mm->mmap_sem);
> +		ret = get_user_pages(current, mm, start,
> +				     nr_pages - nr, write, 0, pages, NULL);
> +		up_read(&mm->mmap_sem);
> +
> +		/* Have to be a bit careful with return values */
> +		if (nr > 0) {
> +			if (ret < 0)
> +				ret = nr;
> +			else
> +				ret += nr;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +#endif /* CONFIG_HAVE_GENERIC_RCU_GUP */
> -- 
> 1.9.3
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>