Re: [PATCH v3 2/6] mm/hugetlb: take page table lock in follow_huge_(addr|pmd|pud)()

Hugh Dickins <hughd@xxxxxxxxxx> · Wed, 3 Sep 2014 14:17:41 -0700 (PDT)

On Thu, 28 Aug 2014, Naoya Horiguchi wrote:

> We have a race condition between move_pages() and freeing hugepages,
> where move_pages() calls follow_page(FOLL_GET) for hugepages internally
> and tries to get its refcount without preventing concurrent freeing.
> This race crashes the kernel, so this patch fixes it by moving FOLL_GET
> code for hugepages into follow_huge_pmd() with taking the page table lock.

You really ought to mention how you are intentionally dropping the
unnecessary check for NULL pte_page() in this patch: we agree on that,
but it does need to be mentioned somewhere in the comment.

> 
> This patch also adds the similar locking to follow_huge_(addr|pud)
> for consistency.
> 
> Here is the reproducer:
> 
>   $ cat movepages.c
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <numaif.h>
> 
>   #define ADDR_INPUT      0x700000000000UL
>   #define HPS             0x200000
>   #define PS              0x1000
> 
>   int main(int argc, char *argv[]) {
>           int i;
>           int nr_hp = strtol(argv[1], NULL, 0);
>           int nr_p  = nr_hp * HPS / PS;
>           int ret;
>           void **addrs;
>           int *status;
>           int *nodes;
>           pid_t pid;
> 
>           pid = strtol(argv[2], NULL, 0);
>           addrs  = malloc(sizeof(char *) * nr_p + 1);
>           status = malloc(sizeof(char *) * nr_p + 1);
>           nodes  = malloc(sizeof(char *) * nr_p + 1);
> 
>           while (1) {
>                   for (i = 0; i < nr_p; i++) {
>                           addrs[i] = (void *)ADDR_INPUT + i * PS;
>                           nodes[i] = 1;
>                           status[i] = 0;
>                   }
>                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
>                                         MPOL_MF_MOVE_ALL);
>                   if (ret == -1)
>                           err("move_pages");
> 
>                   for (i = 0; i < nr_p; i++) {
>                           addrs[i] = (void *)ADDR_INPUT + i * PS;
>                           nodes[i] = 0;
>                           status[i] = 0;
>                   }
>                   ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
>                                         MPOL_MF_MOVE_ALL);
>                   if (ret == -1)
>                           err("move_pages");
>           }
>           return 0;
>   }
> 
>   $ cat hugepage.c
>   #include <stdio.h>
>   #include <sys/mman.h>
>   #include <string.h>
> 
>   #define ADDR_INPUT      0x700000000000UL
>   #define HPS             0x200000
> 
>   int main(int argc, char *argv[]) {
>           int nr_hp = strtol(argv[1], NULL, 0);
>           char *p;
> 
>           while (1) {
>                   p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
>                            MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
>                   if (p != (void *)ADDR_INPUT) {
>                           perror("mmap");
>                           break;
>                   }
>                   memset(p, 0, nr_hp * HPS);
>                   munmap(p, nr_hp * HPS);
>           }
>   }
> 
>   $ sysctl vm.nr_hugepages=40
>   $ ./hugepage 10 &
>   $ ./movepages 10 $(pgrep -f hugepage)
> 
> Note for stable inclusion:
>   This patch fixes e632a938d914 ("mm: migrate: add hugepage migration code
>   to move_pages()"), so is applicable to -stable kernels which includes it.

Just say
Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")

> 
> ChangeLog v3:
> - remove unnecessary if (page) check
> - check (pmd|pud)_huge again after holding ptl
> - do the same change also on follow_huge_pud()
> - take page table lock also in follow_huge_addr()
> 
> ChangeLog v2:
> - introduce follow_huge_pmd_lock() to do locking in arch-independent code.

ChangeLog vN info belongs below the ---

> 
> Reported-by: Hugh Dickins <hughd@xxxxxxxxxx>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
> Cc: <stable@xxxxxxxxxxxxxxx>  # [3.12+]

No ack to this one yet, I'm afraid.

> ---
>  arch/ia64/mm/hugetlbpage.c    |  9 +++++++--
>  arch/metag/mm/hugetlbpage.c   |  4 ++--
>  arch/powerpc/mm/hugetlbpage.c | 22 +++++++++++-----------
>  include/linux/hugetlb.h       | 12 ++++++------
>  mm/gup.c                      | 25 ++++---------------------
>  mm/hugetlb.c                  | 43 +++++++++++++++++++++++++++++++------------
>  6 files changed, 61 insertions(+), 54 deletions(-)
> 
> diff --git mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> index 52b7604b5215..6170381bf074 100644
> --- mmotm-2014-08-25-16-52.orig/arch/ia64/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/ia64/mm/hugetlbpage.c
> @@ -91,17 +91,22 @@ int prepare_hugepage_range(struct file *file,
>  
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  	pte_t *ptep;
> +	spinlock_t *ptl;
>  
>  	if (REGION_NUMBER(addr) != RGN_HPAGE)
>  		return ERR_PTR(-EINVAL);
>  
>  	ptep = huge_pte_offset(mm, addr);
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);

It was a mistake to lump this follow_huge_addr() change in with the
rest: please defer it to your 6/6 (or send 5 and leave 6th to later).

Unless I'm missing something, all you succeed in doing here is break
the build on ia64 and powerpc, by introducing undeclared "vma" variable.

There is no point whatever in taking and dropping this lock: the
point was to do the get_page while holding the relevant page table lock,
but you're not doing any get_page, and you still have an "int write"
argument instead of "int flags" to pass down the FOLL_GET flag,
and you still have the BUG_ON(flags & FOLL_GET) in follow_page_mask().

So, please throw these follow_huge_addr() parts out this patch.

>  	if (!ptep || pte_none(*ptep))
> -		return NULL;
> +		goto out;
> +
>  	page = pte_page(*ptep);
>  	page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT);
> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  int pmd_huge(pmd_t pmd)
> diff --git mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> index 745081427659..5e96ef096df9 100644
> --- mmotm-2014-08-25-16-52.orig/arch/metag/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/metag/mm/hugetlbpage.c
> @@ -104,8 +104,8 @@ int pud_huge(pud_t pud)
>  	return 0;
>  }
>  
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -			     pmd_t *pmd, int write)
> +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +			     pmd_t *pmd, int flags)

Change from "write" to "flags" is good, but I question below whether
we actually need to change from mm to vma in follow_huge_pmd() and
follow_huge_pud().

>  {
>  	return NULL;
>  }
> diff --git mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> index 9517a93a315c..1d8854a56309 100644
> --- mmotm-2014-08-25-16-52.orig/arch/powerpc/mm/hugetlbpage.c
> +++ mmotm-2014-08-25-16-52/arch/powerpc/mm/hugetlbpage.c
> @@ -677,38 +677,38 @@ struct page *
>  follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
>  {
>  	pte_t *ptep;
> -	struct page *page;
> +	struct page *page = ERR_PTR(-EINVAL);
>  	unsigned shift;
>  	unsigned long mask;
> +	spinlock_t *ptl;
>  	/*
>  	 * Transparent hugepages are handled by generic code. We can skip them
>  	 * here.
>  	 */
>  	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
> -
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep);

As above, you're breaking the build with a lock that serves no purpose
in the current patch.

>  	/* Verify it is a huge page else bail. */
>  	if (!ptep || !shift || pmd_trans_huge(*(pmd_t *)ptep))
> -		return ERR_PTR(-EINVAL);
> +		goto out;
>  
>  	mask = (1UL << shift) - 1;
> -	page = pte_page(*ptep);
> -	if (page)
> -		page += (address & mask) / PAGE_SIZE;
> -
> +	page = pte_page(*ptep) + ((address & mask) >> PAGE_SHIFT);
> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  
>  struct page *
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, int flags)
>  {
>  	BUG();
>  	return NULL;
>  }
>  
>  struct page *
> -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> +		pud_t *pud, int flags)
>  {
>  	BUG();
>  	return NULL;
> diff --git mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> index 6e6d338641fe..b3200fce07aa 100644
> --- mmotm-2014-08-25-16-52.orig/include/linux/hugetlb.h
> +++ mmotm-2014-08-25-16-52/include/linux/hugetlb.h
> @@ -98,10 +98,10 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
>  int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
>  struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  			      int write);
> -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -				pmd_t *pmd, int write);
> -struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -				pud_t *pud, int write);
> +struct page *follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +				pmd_t *pmd, int flags);
> +struct page *follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> +				pud_t *pud, int flags);
>  int pmd_huge(pmd_t pmd);
>  int pud_huge(pud_t pmd);
>  unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> @@ -133,8 +133,8 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
>  static inline void hugetlb_show_meminfo(void)
>  {
>  }
> -#define follow_huge_pmd(mm, addr, pmd, write)	NULL
> -#define follow_huge_pud(mm, addr, pud, write)	NULL
> +#define follow_huge_pmd(vma, addr, pmd, flags)	NULL
> +#define follow_huge_pud(vma, addr, pud, flags)	NULL
>  #define prepare_hugepage_range(file, addr, len)	(-EINVAL)
>  #define pmd_huge(x)	0
>  #define pud_huge(x)	0
> diff --git mmotm-2014-08-25-16-52.orig/mm/gup.c mmotm-2014-08-25-16-52/mm/gup.c
> index 91d044b1600d..597a5e92e265 100644
> --- mmotm-2014-08-25-16-52.orig/mm/gup.c
> +++ mmotm-2014-08-25-16-52/mm/gup.c
> @@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  	pud = pud_offset(pgd, address);
>  	if (pud_none(*pud))
>  		return no_page_table(vma, flags);
> -	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB) {
> -		if (flags & FOLL_GET)
> -			return NULL;
> -		page = follow_huge_pud(mm, address, pud, flags & FOLL_WRITE);
> -		return page;
> -	}
> +	if (pud_huge(*pud) && vma->vm_flags & VM_HUGETLB)
> +		return follow_huge_pud(vma, address, pud, flags);

Yes, this part is good, except I think mm rather than vma.

>  	if (unlikely(pud_bad(*pud)))
>  		return no_page_table(vma, flags);
>  
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		return no_page_table(vma, flags);
> -	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
> -		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
> -		if (flags & FOLL_GET) {
> -			/*
> -			 * Refcount on tail pages are not well-defined and
> -			 * shouldn't be taken. The caller should handle a NULL
> -			 * return when trying to follow tail pages.
> -			 */
> -			if (PageHead(page))
> -				get_page(page);
> -			else
> -				page = NULL;
> -		}
> -		return page;
> -	}
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB)
> +		return follow_huge_pmd(vma, address, pmd, flags);

And this part is good, except I think mm rather than vma.

>  	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
>  		return no_page_table(vma, flags);
>  	if (pmd_trans_huge(*pmd)) {
> diff --git mmotm-2014-08-25-16-52.orig/mm/hugetlb.c mmotm-2014-08-25-16-52/mm/hugetlb.c
> index 022767506c7b..c5345c5edb50 100644
> --- mmotm-2014-08-25-16-52.orig/mm/hugetlb.c
> +++ mmotm-2014-08-25-16-52/mm/hugetlb.c
> @@ -3667,26 +3667,45 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address,
>  }
>  
>  struct page * __weak
> -follow_huge_pmd(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd, int write)
> +follow_huge_pmd(struct vm_area_struct *vma, unsigned long address,
> +		pmd_t *pmd, int flags)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
>  
> -	page = pte_page(*(pte_t *)pmd);
> -	if (page)
> -		page += ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);

So, this is why you have had to change from "mm" to "vma" throughout.
And we might end up deciding that that is the right thing to do.

But here we are deep in page table code, dealing with a huge pmd entry:
I protest that it's very lame to be asking vma->vm_file to tell us what
lock the page table code needs at this level.  Isn't it pmd_lockptr()?

Now, I'm easily confused, and there may be reasons why it's more subtle
than that, and you really are forced to use huge_pte_lockptr(); but I'd
much rather not if we can avoid doing so, just as a matter of principle.

One subtlety to take care over: it's a long time since I've had to
worry about pmd folding and pud folding (what happens when you only
have 2 or 3 levels of page table instead of the full 4): macros get
defined to each other, and levels get optimized out (perhaps
differently on different architectures).

So although at first sight the lock to take in follow_huge_pud()
would seem to be mm->page_table_lock, I am not at this point certain
that that's necessarily so - sometimes pud_huge might be pmd_huge,
and the size PMD_SIZE, and pmd_lockptr appropriate at what appears
to be the pud level.  Maybe: needs checking through the architectures
and their configs, not obvious to me.

I realize that I am asking for you (or I) to do more work, when using
huge_pte_lock(hstate_vma(vma),,) would work it out "automatically";
but I do feel quite strongly that that's the right approach here
(and I'm not just trying to avoid a few edits of "mm" to "vma").

Cc'ing Kirill, who may have a strong view to the contrary,
or a good insight on where the problems if any might be.

Also Cc'ing Kirill because I'm not convinced that huge_pte_lockptr()
necessarily does the right thing on follow_huge_addr() architectures,
ia64 and powerpc.  Do they, for example, allocate the memory for their
hugetlb entries in such a way that we can indeed use pmd_lockptr() to
point to a useable spinlock, in the case when huge_page_size(h) just
happens to equal PMD_SIZE?

I don't know if this was thought through thoroughly
(now that's a satisfying phrase hugh thinks hugh never wrote before!)
when huge_pte_lockptr() was invented or not.  I think it would be safer
if huge_pte_lockptr() just gave mm->page_table_lock on follow_huge_addr()
architectures.

> +
> +	if (!pmd_huge(*pmd))
> +		goto out;
> +
> +	page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +
> +	if (flags & FOLL_GET)
> +		if (!get_page_unless_zero(page))
> +			page = NULL;

get_page() should be quite good enough, shouldn't it?  We are holding
the necessary lock, and have tested pmd_huge(*pmd), so it would be a
bug if page_count(page) were zero here.

> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  
>  struct page * __weak
> -follow_huge_pud(struct mm_struct *mm, unsigned long address,
> -		pud_t *pud, int write)
> +follow_huge_pud(struct vm_area_struct *vma, unsigned long address,
> +		pud_t *pud, int flags)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
> +	spinlock_t *ptl;
>  
> -	page = pte_page(*(pte_t *)pud);
> -	if (page)
> -		page += ((address & ~PUD_MASK) >> PAGE_SHIFT);
> +	if (flags & FOLL_GET)
> +		return NULL;
> +
> +	ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pud);

Well, you do have vma declared here, but otherwise it's like what you
had in follow_huge_addr(): there is no point in taking and dropping
the lock if you're not getting the page while the lock is held.

So, which way to go on follow_huge_pud()?  I certainly think that we
should implement FOLL_GET on it, as we should for follow_huge_addr(),
simply for completeness, and so we don't need to come back here.

But whether we should do so in a patch which is Cc'ed to stable is not
so clear.  And leaving follow_huge_pmd() and follow_huge_addr() out
of this patch may avoid those awkward where-is-the-lock questions
for now.  Convert follow_huge_pmd() in a separate patch?

> +
> +	if (!pud_huge(*pud))
> +		goto out;
> +
> +	page = pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
> +out:
> +	spin_unlock(ptl);
>  	return page;
>  }
>  
> -- 
> 1.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>