On Sat, Aug 09, 2014 at 04:01:39PM -0700, Hugh Dickins wrote: > On Fri, 1 Aug 2014, Naoya Horiguchi wrote: ... > > diff --git mmotm-2014-07-22-15-58.orig/mm/gup.c mmotm-2014-07-22-15-58/mm/gup.c > > index 91d044b1600d..e4bd59efe686 100644 > > --- mmotm-2014-07-22-15-58.orig/mm/gup.c > > +++ mmotm-2014-07-22-15-58/mm/gup.c > > @@ -174,21 +174,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma, > > pmd = pmd_offset(pud, address); > > if (pmd_none(*pmd)) > > return no_page_table(vma, flags); > > - if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) { > > - page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE); > > - if (flags & FOLL_GET) { > > - /* > > - * Refcount on tail pages are not well-defined and > > - * shouldn't be taken. The caller should handle a NULL > > - * return when trying to follow tail pages. > > - */ > > - if (PageHead(page)) > > - get_page(page); > > - else > > - page = NULL; > > - } > > - return page; > > - } > > + if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) > > + return follow_huge_pmd_lock(vma, address, pmd, flags); > > Yes, that's good (except I don't like the _lock name). > > > if ((flags & FOLL_NUMA) && pmd_numa(*pmd)) > > return no_page_table(vma, flags); > > if (pmd_trans_huge(*pmd)) { > > diff --git mmotm-2014-07-22-15-58.orig/mm/hugetlb.c mmotm-2014-07-22-15-58/mm/hugetlb.c > > index 7263c770e9b3..4437896cd6ed 100644 > > --- mmotm-2014-07-22-15-58.orig/mm/hugetlb.c > > +++ mmotm-2014-07-22-15-58/mm/hugetlb.c > > @@ -3687,6 +3687,33 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address, > > > > #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ > > > > +struct page *follow_huge_pmd_lock(struct vm_area_struct *vma, > > + unsigned long address, pmd_t *pmd, int flags) > > +{ > > + struct page *page; > > + spinlock_t *ptl; > > + > > + if (flags & FOLL_GET) > > + ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd); > > + > > But this is not good enough, I'm afraid. > > > + page = follow_huge_pmd(vma->vm_mm, address, pmd, flags & FOLL_WRITE); > > + > > + if (flags & FOLL_GET) { > > + /* > > + * Refcount on tail pages are not well-defined and > > + * shouldn't be taken. The caller should handle a NULL > > + * return when trying to follow tail pages. > > + */ > > + if (PageHead(page)) > > + get_page(page); > > + else > > + page = NULL; > > + spin_unlock(ptl); > > + } > > + > > + return page; > > +} > > + > > #ifdef CONFIG_MEMORY_FAILURE > > > > /* Should be called in hugetlb_lock */ > > -- > > 1.9.3 > > Thanks a lot for remembering this, but it's not enough, I think. Thank you very much for detailed comments. > It is an improvement over the current code (except for the annoying new > level, and its confusing name follow_huge_pmd_lock); but I don't want to > keep on coming back, repeatedly sending new corrections to four or more > releases of -stable. Please let's get it right and be done with it. > > I see two problems with the above, but perhaps I'm mistaken. > > One is hugetlb_vmtruncate(): follow_huge_pmd_lock() is only called > when we have observed pmd_huge(*pmd), fine, but how can we assume > that pmd_huge(*pmd) still after getting the necessary huge_pte_lock? > Truncation could have changed that *pmd to none, and then pte_page() > will supply an incorrect (but non-NULL) address. OK, this race window is still open, so double-checking inside huge_pte_lock() should be necessary. > (I observe the follow_huge_pmd()s all doing an "if (page)" after > their pte_page(), but when I checked at the time of the original > follow_huge_addr() problem, I could not find any architecture with > a pte_page() returning NULL for an invalid entry - pte_page() is > a simple blind translation in every architecture, I believe, but > please check.) I agree that no implementation of pte_page() has invalid entry check. They essentially do (base address + pfn calculated by pte_val), so never return null for any regular input. > Two is x86-32 PAE (and perhaps some other architectures), in which > the pmd entry spans two machine words, loaded independently. It's a > very narrow race window, but we cannot safely access the whole *pmd > without locking: we might pick up two mismatched halves. Take a look > at pte_unmap_same() in mm/memory.c, it's handling that issue on ptes. OK, so ... > So, if I follow my distaste for the intermediate follow_huge_pmd_lock > level (and in patch 4/3 you are already changing all the declarations, > so no need to be deterred by that task), I think what we need is: > > struct page * > follow_huge_pmd(struct vm_area_struct *vma, unsigned long address, > pmd_t *pmd, unsigned int flags) > { > struct page *page; > spinlock_t *ptl; > > ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd); > > if (!pmd_huge(*pmd)) { .. I guess that checking pte_same() is better than pmd_huge(), because it also covers your 2nd concern above? > page = NULL; > goto out; > } > > page = pte_page(*(pte_t *)pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); > > if (flags & FOLL_GET) { > /* > * Refcount on tail pages are not well-defined and > * shouldn't be taken. The caller should handle a NULL > * return when trying to follow tail pages. > */ > if (PageHead(page)) > get_page(page); > else > page = NULL; > } > out: > spin_unlock(ptl); > return page; > } > > Yes, there are many !FOLL_GET cases which could use an ACCESS_ONCE(*pmd) > and avoid taking the lock; but follow_page_pte() is content to take its > lock in all cases, so I don't see any need to avoid it in this much less > common case. OK. > And it looks to me as if this follow_huge_pmd() would be good for every > hugetlb architecture (but I may especially be wrong on that, having > compiled it for none but x86_64). On some architectures, the ones which > currently present just a stub follow_huge_pmd(), the optimizer should > eliminate everything after the !pmd_huge test, and we won't be calling > it on those anyway. On mips s390 and tile, I think the above represents > what they're currently doing, despite some saying HPAGE_MASK in place of > PMD_MASK, and having that funny "if (page)" condition after pte_page(). Yes, I think that we can replace all arch-dependent follow_huge_pmd() with the above one. Then we need check hugepage_migration_supported() on move_pages() side. > Please check carefully: I think the above follow_huge_pmd() can sit in > mm/hugetlb.c, for use on all architectures; and the variants be deleted; > and I think that would be an improvement. Yes, I'll try this. > I'm not sure what should happen to follow_huge_pud() if we go this way. > There's a good argument for adapting it in exactly the same way, but > that may not appeal to those wanting to remove the never used argument. > > And, please, let's go just a little further, while we are having to > think of these issues. Isn't what we're doing here much the same as > we need to do to follow_huge_addr(), to fix the May 28th issues which > led you to disable hugetlb migration on all but x86_64? Currently follow_huge_addr() ignores FOLL_GET, so just fixing locking problem as this patch do is not enough for follow_huge_addr(). But when we reenable hugepage migration for example on powerpc, we will need consider exactly the same problem. > I'm not arguing to re-enable hugetlb migration on those architectures > which you cannot test, no, you did the right thing to leave that to > them. But could we please update follow_huge_addr() (in a separate > patch) to make it consistent with this follow_huge_pmd(), so that at > least you can tell maintainers that you believe it is working now? OK, I'll do it. What we can do is to take page table lock for pte_page, and to remove unnecessary "if (page)" check, I think. > Uh oh. I thought I had finished writing about this patch, but just > realized more. Above you can see that I've faithfully copied your > "Refcount on tail pages are not well-defined" comment and !PageHead > NULL. But that's nonsense, isn't it? Refcount on tail pages is and > must be well-defined, and that's been handled in follow_hugetlb_page() > for, well, at least ten years. Ah, this "it's not well-defined" was completely wrong. I'll remove it. > > But note the "Some archs" comment in follow_hugetlb_page(): I have > not followed it up, and it may prove to be irrelevant here; but it > suggests that in general some additional care might be needed for > the get_page()s - or perhaps they should now be get_page_folls()? Yes, it's get_page_folls() now, and it internally pins tail pages by incrementing page->_mapcount, so using get_page_unless_zero() in follow_huge_pmd_lock() might be better (to skip tails naturally.) Thanks, Naoya Horiguchi > I guess the "not well-defined" comment was your guess as to why I had > put in the BUG_ON(flags & FOLL_GET)s: no, they were because nobody > required huge FOLL_GET at that time, and that case lacked the locking > which you are now supplying. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>