Hi Hugh, On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote: > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED). > > It's invalid: we don't always have down_write of mmap_sem there: > a racing do_huge_pmd_wp_page() might have copied-on-write to another > huge page before our split_huge_page() got the anon_vma lock. > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you elaborate? My scenario is that in the below line another madvise(MADV_DONTNEED) runs: spin_unlock(&mm->page_table_lock); mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end); ------ madvise(MADV_DONTNEED) from another thread zapping the entire 2m and clearing the pmd split_huge_page(page); then split_huge_page does nothing because the page has been freed. And then just before the BUG_ON (after split_huge_page returns) a new page fault fills in an anonymous page just before the BUG_ON. And the crashing thread would always be a partial MADV_DONTNEED with a misaligned end address (so requiring a split_huge_page to zap 4k subpages). So the testcase required would be: 2 concurrent MADV_DONTNEED, where the first has a misaligned "end" address (the one that triggers the BUG_ON), the second MADV_DONTNEED has a end address that covers the whole hugepmd, and a trans huge page fault happening just before the false positive triggers. Maybe your scenario with do_huge_pmd_wp_page() is simpler? > Forget the BUG_ON, just go back and try again if this happens. > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > --- > > mm/huge_memory.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > --- 3.12-rc5/mm/huge_memory.c 2013-09-16 17:37:56.811072270 -0700 > +++ linux/mm/huge_memory.c 2013-10-15 03:40:02.044138488 -0700 > @@ -2697,6 +2697,7 @@ void __split_huge_page_pmd(struct vm_are > > mmun_start = haddr; > mmun_end = haddr + HPAGE_PMD_SIZE; > +again: > mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end); > spin_lock(&mm->page_table_lock); > if (unlikely(!pmd_trans_huge(*pmd))) { > @@ -2719,7 +2720,14 @@ void __split_huge_page_pmd(struct vm_are > split_huge_page(page); > > put_page(page); > - BUG_ON(pmd_trans_huge(*pmd)); > + > + /* > + * We don't always have down_write of mmap_sem here: a racing > + * do_huge_pmd_wp_page() might have copied-on-write to another > + * huge page before our split_huge_page() got the anon_vma lock. > + */ > + if (unlikely(pmd_trans_huge(*pmd))) > + goto again; > } > > void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address, While it is correct, the looping is misleading. We should document simply that any caller of split_huge_page_pmd with the mmap_sem for reading, should use pmd_none_or_trans_huge_or_clear_bad or pmd_trans_unstable before making any assumption on the pmd being "stable" after split_huge_page returns. This is what zap_pmd_range of course does already so it's perfectly safe if split_huge_page_pmd returns with pmd_trans_huge(*pmd) == true. Even with the loop, if the concurrent page faults that maps a trans_huge_pmd (replacing the pmd_none) triggers just after the 'goto again', but before the 'ret' instruction, the function still returns with a trans huge page mapped in the *pmd. In short I think either we try the more strict but more tricky approach from Hillf https://patchwork.kernel.org/patch/2178311/ and we also take into account the mapcount > 0 in the __split_huge_page loop to make Hillf patch safe (I think currently it isn't), or we just nuke the BUG_ON completely. And we should document in the source what happens with mmap_sem hold for reading in MADV_DONTNEED. Yet another approach would be to add something like down_write_trylock in front of the check after converting it to a VM_BUG_ON. The patch in patchwork, despite trying to be more strict, it doesn't look safe to me because I tend to think we should also change __split_huge_page to return the "mapcount", so that split_huge_page will fail also if the mapcount was 0 in __split_huge_page. I believe split_huge_page_to_list may obtain the anon_vma lock (so the page was mapped) but then zap_huge_pmd obtains the page_table_lock before __split_huge_page runs __split_huge_page_splitting and freezes any attempt zap_huge_pmd or any other attempt of MADV_DONTNEED with truncation of the pmd (waiting in another split_huge_page). So I believe it is possible (and safe) for it to run with mapcount 0 (doing nothing). But it doesn't return failure in that case, but if mapcount is 0, the pmd may not have been converted to stable state. This is why that change would be needed to use the patchwork patch. The only place where we depend on split_huge_page retval so far is add_to_swap. But regular anon pages can also be zapped there. So it should be able to cope and the swapcache will free itself when all refcounts are dropped. So we don't need to worry about the above I think in add_to_swap. Could you also post the stack trace so we compare to the one in patchwork? Ideally this happened for you also in the context of a MADV_DONTNEED, the other places walking pagetables with only the mmap_sem for reading usually don't mangle pagetables. Also note the workload I described with two MADV_DONTNEED running concurrently and a page fault too, on the very same transhugepage looks a non deterministic workload, but still we need to avoid false positives in those non deterministic cases. The two stack traces I have for this problem (patchwork above and below bugzilla) all confirms it's happening only inside MADV_DONTNEED confirming my theory above. https://bugzilla.redhat.com/show_bug.cgi?id=949735 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>