On Fri, 1 Aug 2014, Naoya Horiguchi wrote: > We have a race condition between move_pages() and freeing hugepages, > where move_pages() calls follow_page(FOLL_GET) for hugepages internally > and tries to get its refcount without preventing concurrent freeing. > This race crashes the kernel, so this patch fixes it by moving FOLL_GET > code for hugepages into follow_huge_pmd() with taking the page table lock. > > This patch passes the following test. And libhugetlbfs test shows no > regression. > > $ cat movepages.c > #include <stdio.h> > #include <stdlib.h> > #include <numaif.h> > > #define ADDR_INPUT 0x700000000000UL > #define HPS 0x200000 > #define PS 0x1000 > > int main(int argc, char *argv[]) { > int i; > int nr_hp = strtol(argv[1], NULL, 0); > int nr_p = nr_hp * HPS / PS; > int ret; > void **addrs; > int *status; > int *nodes; > pid_t pid; > > pid = strtol(argv[2], NULL, 0); > addrs = malloc(sizeof(char *) * nr_p + 1); > status = malloc(sizeof(char *) * nr_p + 1); > nodes = malloc(sizeof(char *) * nr_p + 1); > > while (1) { > for (i = 0; i < nr_p; i++) { > addrs[i] = (void *)ADDR_INPUT + i * PS; > nodes[i] = 1; > status[i] = 0; > } > ret = numa_move_pages(pid, nr_p, addrs, nodes, status, > MPOL_MF_MOVE_ALL); > if (ret == -1) > err("move_pages"); > > for (i = 0; i < nr_p; i++) { > addrs[i] = (void *)ADDR_INPUT + i * PS; > nodes[i] = 0; > status[i] = 0; > } > ret = numa_move_pages(pid, nr_p, addrs, nodes, status, > MPOL_MF_MOVE_ALL); > if (ret == -1) > err("move_pages"); > } > return 0; > } > > $ cat hugepage.c > #include <stdio.h> > #include <sys/mman.h> > #include <string.h> > > #define ADDR_INPUT 0x700000000000UL > #define HPS 0x200000 > > int main(int argc, char *argv[]) { > int nr_hp = strtol(argv[1], NULL, 0); > char *p; > > while (1) { > p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); > if (p != (void *)ADDR_INPUT) { > perror("mmap"); > break; > } > memset(p, 0, nr_hp * HPS); > munmap(p, nr_hp * HPS); > } > } > > $ sysctl vm.nr_hugepages=40 > $ ./hugepage 10 & > $ ./movepages 10 $(pgrep -f hugepage) > > Note for stable inclusion: > This patch fixes e632a938d914 ("mm: migrate: add hugepage migration code > to move_pages()"), so is applicable to -stable kernels which includes it. > > ChangeLog v2: > - introduce follow_huge_pmd_lock() to do locking in arch-independent code. > > Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> # [3.12+] > --- > include/linux/hugetlb.h | 3 +++ > mm/gup.c | 17 ++--------------- > mm/hugetlb.c | 27 +++++++++++++++++++++++++++ > 3 files changed, 32 insertions(+), 15 deletions(-) > > diff --git mmotm-2014-07-22-15-58.orig/include/linux/hugetlb.h mmotm-2014-07-22-15-58/include/linux/hugetlb.h > index 41272bcf73f8..194834853d6f 100644 > --- mmotm-2014-07-22-15-58.orig/include/linux/hugetlb.h > +++ mmotm-2014-07-22-15-58/include/linux/hugetlb.h > @@ -101,6 +101,8 @@ struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, > pmd_t *pmd, int write); > struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address, > pud_t *pud, int write); > +struct page *follow_huge_pmd_lock(struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmd, int flags); > int pmd_huge(pmd_t pmd); > int pud_huge(pud_t pmd); > unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > @@ -134,6 +136,7 @@ static inline void hugetlb_show_meminfo(void) > } > #define follow_huge_pmd(mm, addr, pmd, write) NULL > #define follow_huge_pud(mm, addr, pud, write) NULL > +#define follow_huge_pmd_lock(vma, addr, pmd, flags) NULL > #define prepare_hugepage_range(file, addr, len) (-EINVAL) > #define pmd_huge(x) 0 > #define pud_huge(x) 0 > diff --git mmotm-2014-07-22-15-58.orig/mm/gup.c mmotm-2014-07-22-15-58/mm/gup.c > index 91d044b1600d..e4bd59efe686 100644 > --- mmotm-2014-07-22-15-58.orig/mm/gup.c > +++ mmotm-2014-07-22-15-58/mm/gup.c > @@ -174,21 +174,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma, > pmd = pmd_offset(pud, address); > if (pmd_none(*pmd)) > return no_page_table(vma, flags); > - if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) { > - page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE); > - if (flags & FOLL_GET) { > - /* > - * Refcount on tail pages are not well-defined and > - * shouldn't be taken. The caller should handle a NULL > - * return when trying to follow tail pages. > - */ > - if (PageHead(page)) > - get_page(page); > - else > - page = NULL; > - } > - return page; > - } > + if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) > + return follow_huge_pmd_lock(vma, address, pmd, flags); Yes, that's good (except I don't like the _lock name). > if ((flags & FOLL_NUMA) && pmd_numa(*pmd)) > return no_page_table(vma, flags); > if (pmd_trans_huge(*pmd)) { > diff --git mmotm-2014-07-22-15-58.orig/mm/hugetlb.c mmotm-2014-07-22-15-58/mm/hugetlb.c > index 7263c770e9b3..4437896cd6ed 100644 > --- mmotm-2014-07-22-15-58.orig/mm/hugetlb.c > +++ mmotm-2014-07-22-15-58/mm/hugetlb.c > @@ -3687,6 +3687,33 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address, > > #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ > > +struct page *follow_huge_pmd_lock(struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmd, int flags) > +{ > + struct page *page; > + spinlock_t *ptl; > + > + if (flags & FOLL_GET) > + ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd); > + But this is not good enough, I'm afraid. > + page = follow_huge_pmd(vma->vm_mm, address, pmd, flags & FOLL_WRITE); > + > + if (flags & FOLL_GET) { > + /* > + * Refcount on tail pages are not well-defined and > + * shouldn't be taken. The caller should handle a NULL > + * return when trying to follow tail pages. > + */ > + if (PageHead(page)) > + get_page(page); > + else > + page = NULL; > + spin_unlock(ptl); > + } > + > + return page; > +} > + > #ifdef CONFIG_MEMORY_FAILURE > > /* Should be called in hugetlb_lock */ > -- > 1.9.3 Thanks a lot for remembering this, but it's not enough, I think. It is an improvement over the current code (except for the annoying new level, and its confusing name follow_huge_pmd_lock); but I don't want to keep on coming back, repeatedly sending new corrections to four or more releases of -stable. Please let's get it right and be done with it. I see two problems with the above, but perhaps I'm mistaken. One is hugetlb_vmtruncate(): follow_huge_pmd_lock() is only called when we have observed pmd_huge(*pmd), fine, but how can we assume that pmd_huge(*pmd) still after getting the necessary huge_pte_lock? Truncation could have changed that *pmd to none, and then pte_page() will supply an incorrect (but non-NULL) address. (I observe the follow_huge_pmd()s all doing an "if (page)" after their pte_page(), but when I checked at the time of the original follow_huge_addr() problem, I could not find any architecture with a pte_page() returning NULL for an invalid entry - pte_page() is a simple blind translation in every architecture, I believe, but please check.) Two is x86-32 PAE (and perhaps some other architectures), in which the pmd entry spans two machine words, loaded independently. It's a very narrow race window, but we cannot safely access the whole *pmd without locking: we might pick up two mismatched halves. Take a look at pte_unmap_same() in mm/memory.c, it's handling that issue on ptes. So, if I follow my distaste for the intermediate follow_huge_pmd_lock level (and in patch 4/3 you are already changing all the declarations, so no need to be deterred by that task), I think what we need is: struct page * follow_huge_pmd(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, unsigned int flags) { struct page *page; spinlock_t *ptl; ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd); if (!pmd_huge(*pmd)) { page = NULL; goto out; } page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); if (flags & FOLL_GET) { /* * Refcount on tail pages are not well-defined and * shouldn't be taken. The caller should handle a NULL * return when trying to follow tail pages. */ if (PageHead(page)) get_page(page); else page = NULL; } out: spin_unlock(ptl); return page; } Yes, there are many !FOLL_GET cases which could use an ACCESS_ONCE(*pmd) and avoid taking the lock; but follow_page_pte() is content to take its lock in all cases, so I don't see any need to avoid it in this much less common case. And it looks to me as if this follow_huge_pmd() would be good for every hugetlb architecture (but I may especially be wrong on that, having compiled it for none but x86_64). On some architectures, the ones which currently present just a stub follow_huge_pmd(), the optimizer should eliminate everything after the !pmd_huge test, and we won't be calling it on those anyway. On mips s390 and tile, I think the above represents what they're currently doing, despite some saying HPAGE_MASK in place of PMD_MASK, and having that funny "if (page)" condition after pte_page(). Please check carefully: I think the above follow_huge_pmd() can sit in mm/hugetlb.c, for use on all architectures; and the variants be deleted; and I think that would be an improvement. I'm not sure what should happen to follow_huge_pud() if we go this way. There's a good argument for adapting it in exactly the same way, but that may not appeal to those wanting to remove the never used argument. And, please, let's go just a little further, while we are having to think of these issues. Isn't what we're doing here much the same as we need to do to follow_huge_addr(), to fix the May 28th issues which led you to disable hugetlb migration on all but x86_64? I'm not arguing to re-enable hugetlb migration on those architectures which you cannot test, no, you did the right thing to leave that to them. But could we please update follow_huge_addr() (in a separate patch) to make it consistent with this follow_huge_pmd(), so that at least you can tell maintainers that you believe it is working now? Uh oh. I thought I had finished writing about this patch, but just realized more. Above you can see that I've faithfully copied your "Refcount on tail pages are not well-defined" comment and !PageHead NULL. But that's nonsense, isn't it? Refcount on tail pages is and must be well-defined, and that's been handled in follow_hugetlb_page() for, well, at least ten years. But note the "Some archs" comment in follow_hugetlb_page(): I have not followed it up, and it may prove to be irrelevant here; but it suggests that in general some additional care might be needed for the get_page()s - or perhaps they should now be get_page_folls()? I guess the "not well-defined" comment was your guess as to why I had put in the BUG_ON(flags & FOLL_GET)s: no, they were because nobody required huge FOLL_GET at that time, and that case lacked the locking which you are now supplying. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>