Hi Hugh, Sorry for the delayed response, I was off for vacation. On Mon, Sep 29, 2014 at 09:32:20PM -0700, Hugh Dickins wrote: > On Mon, 15 Sep 2014, Naoya Horiguchi wrote: > > We have a race condition between move_pages() and freeing hugepages, > > I've been looking through these 5 today, and they're much better now, > thank you. But a new concern below, and a minor correction to 3/5. > > > --- mmotm-2014-09-09-14-42.orig/mm/gup.c > > +++ mmotm-2014-09-09-14-42/mm/gup.c > > @@ -162,33 +162,16 @@ struct page *follow_page_mask(struct vm_area_struct *vma, > > > > pmd = pmd_offset(pud, address); > > if (pmd_none(*pmd)) > > return no_page_table(vma, flags); > > - if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) { > > - page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE); > > - if (flags & FOLL_GET) { > > - /* > > - * Refcount on tail pages are not well-defined and > > - * shouldn't be taken. The caller should handle a NULL > > - * return when trying to follow tail pages. > > - */ > > - if (PageHead(page)) > > - get_page(page); > > - else > > - page = NULL; > > - } > > - return page; > > - } > > + if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) > > + return follow_huge_pmd(mm, address, pmd, flags); > > This code here allows for pmd_none() and pmd_huge(), and for pmd_numa() > and pmd_trans_huge() below; but makes no explicit allowance for !present > migration and hwpoison entries. > > Is it assumed that the pmd_bad() test in follow_page_pte() will catch > those? Yes, it is now. > But what of races? The current patch is still racy when hugepage migrations from different reasons (hotremove and mbind, for example) happen concurrently. We need a fix. > migration entries are highly volatile. And > is it assumed that a migration entry cannot pass the pmd_huge() test? Yes, _PAGE_PSE bit is always clear for migration/hwpoison entry, so they can never pass pmd_huge() test for now. > That may be true of x86 today, I'm not certain; but if the soft-dirty > people catch up with the hugetlb-migration people, that might change > (they #define _PAGE_SWP_SOFT_DIRTY _PAGE_PSE). Yes, this problem is not visible now (note that currently _PAGE_SWP_SOFT_DIRTY is never set on pmd because hugepage is never swapped out,) but it's potential one. > > Why pmd_huge() does not itself test for present, I cannot say; but it > probably didn't matter at all before hwpoison and migration were added. Correct, so we need check _PAGE_PRESENT bit in x86_64 pmd_huge() now. And we need do some proper actions if we find migration/hwpoison here. To do this, adding another routine like huge_pmd_present() might be useful (pmd_present() is already used for thp.) > > Mind you, with __get_user_pages()'s is_vm_hugtlb_page() test avoiding > all this code, maybe the only thing that can stumble here is your own > hugetlb migration code; but that appears to be guarded only by > down_read of mmap_sem, so races would be possible (if userspace > is silly enough or malicious enough to do so). I guess that the race is fixed by this patch with checking _PAGE_PRESENT in pmd_huge(). Or are you mentioning another race? > > What we have here today looks too fragile to me, but it's probably > best dealt with by a separate patch. > > Or I may be over-anxious, and there may be something "obvious" > that I'm missing, which saves us from further change. No, you found a new issue in the current code, thank you very much. > > if ((flags & FOLL_NUMA) && pmd_numa(*pmd)) > > return no_page_table(vma, flags); > > if (pmd_trans_huge(*pmd)) { > > diff --git mmotm-2014-09-09-14-42.orig/mm/hugetlb.c mmotm-2014-09-09-14-42/mm/hugetlb.c > > index 34351251e164..941832ee3d5a 100644 > > --- mmotm-2014-09-09-14-42.orig/mm/hugetlb.c > > +++ mmotm-2014-09-09-14-42/mm/hugetlb.c > > @@ -3668,26 +3668,34 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address, > > > > struct page * __weak > > follow_huge_pmd(struct mm_struct *mm, unsigned long address, > > - pmd_t *pmd, int write) > > + pmd_t *pmd, int flags) > > { > > - struct page *page; > > + struct page *page = NULL; > > + spinlock_t *ptl; > > > > - page = pte_page(*(pte_t *)pmd); > > - if (page) > > - page += ((address & ~PMD_MASK) >> PAGE_SHIFT); > > + ptl = pmd_lockptr(mm, pmd); > > + spin_lock(ptl); > > + > > + if (!pmd_huge(*pmd)) > > + goto out; > > And similarly here. Though at least here we now have the necessary > lock, so it's no longer racy, and maybe this pmd_huge() test just needs > to be replaced by a pmd_present() test? Or are both needed? This check is introduced because the first pmd_huge() check outside follow_huge_pmd() is called without page table lock. So keeping it to recheck after holding lock looks correct to me. But as I mentioned above, I'm thinking of changing x86_64's pmd_huge to check both _PAGE_PRESENT and _PAGE_PSE to make sure that *pmd is pointing to a normal hugepage, so this check is internally changed to check both. Thanks, Naoya Horiguchi > > + > > + page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); > > + > > + if (flags & FOLL_GET) > > + get_page(page); > > +out: > > + spin_unlock(ptl); > > return page; > > } > -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html