On Tue, 15 Oct 2013, Kirill A. Shutemov wrote: > Andrea Arcangeli wrote: > > Hi Hugh, > > > > On Tue, Oct 15, 2013 at 04:08:28AM -0700, Hugh Dickins wrote: > > > Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of > > > __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED). > > > > > > It's invalid: we don't always have down_write of mmap_sem there: > > > a racing do_huge_pmd_wp_page() might have copied-on-write to another > > > huge page before our split_huge_page() got the anon_vma lock. > > > > > > > I don't get exactly the scenario with do_huge_pmd_wp_page(), could you > > elaborate? > > I think the scenario is follow: > > CPU0: CPU1 > > __split_huge_page_pmd() > page = pmd_page(*pmd); > do_huge_pmd_wp_page() copy the > page and changes pmd (the same as on CPU0) > to point to newly copied page. > split_huge_page(page) > where page is original page, > not allocated on COW. > pmd still points on huge page. > > > Hugh, have I got it correctly? Yes, that's correct, that's what I've been assuming the race is. With CPU0 split_huge_page_pmd() being called from zap_pmd_range() in the service of madvise(,,MADV_DONTNEED). I don't have the stacktrace to hand: could perfectly well dig it out in an hour or two, but honestly, it adds nothing more to the picture. I have no trace of the CPU1 side of things, and have merely surmised that it was doing a COW. As to whether the MADV_DONTNEED down_read optimization is important: I don't recall, might be able to discover justification in old mail, 0a27a14a6292 doesn't actually say; but in general we're much better off using down_read than down_write where it's safe to do so. But more importantly, MADV_DONTNEED down_read across zap_page_range is building on the fact that file invalidation/truncation can already call zap_page_range without touching mmap_sem at all: not a problem for traditional anon-only THP, but something you'll have had to worry about for THPageCache. I'm afraid Andrea's mail about concurrent madvises gives me far more to think about than I have time for: seems to get into problems he knows a lot about but I'm unfamiliar with. If this patch looks good for now on its own, let's put it in; but no problem if you guys prefer to wait for a fuller solution of more problems, we can ride with this one internally for the moment. And I should admit that the crash has occurred too rarely for us yet to be able to judge whether this patch actually fixes it in practice. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>