On Tue, Feb 23, 2016 at 06:49:50PM +0300, Kirill A. Shutemov wrote: > Hi Andrea, > > I suspect there's race with THP in __handle_mm_fault(). It's pure > theoretical and race window is small, but.. > > Consider following scenario: > > - THP got allocated by other thread just before "pmd_none() && > __pte_alloc()" check, so pmd_none() is false and we don't > allocate the page table. > > - But before pmd_trans_huge() check the page got unmap by > MADV_DONTNEED in other thread. > > - At this point we will call pte_offset_map() for pmd which is > pmd_none(). > > Nothing pleasant would happen after this... > > Do you see anything what would prevent this scenario? No so I think we need s/pmd_trans_huge/pmd_trans_unstable/ and use the atomic read in C to sort this out lockless. The MADV_DONTNEED part that isn't holding the mmap_sem for writing unfortunately wasn't sorted out immediately, that was unexpected in fact. pmd_trans_unstable() was introduced precisely to handle this trouble caused by MADV_DONTNEED running with the mmap_sem only for reading which causes infinite possible transactions back and forth between none and transhuge while holding only the mmap_sem for reading. == >From eae4f251604299082dd824dc8acade71268c8d88 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli <aarcange@xxxxxxxxxx> Date: Tue, 23 Feb 2016 18:56:55 +0100 Subject: [PATCH 1/1] mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED pmd_trans_unstable/pmd_none_or_trans_huge_or_clear_bad were introduced to locklessy (but atomically) detect when a pmd is a regular (stable) pmd or when the pmd is unstable and can infinitely transition from pmd_none and pmd_trans_huge from under us, while only holding the mmap_sem for reading (for writing not). While holding the mmap_sem only for reading, MADV_DONTNEED can run from under us and so before we can threat the pmd as regular we need to compare it against pmd_none and pmd_trans_huge in an atomic way, with pmd_trans_unstable(). The old pmd_trans_huge check is correct but it leaves a tiny window for a race. Useful applications are unlikely to notice the difference as doing MADV_DONTNEED concurrently with a page fault would lead to undefined behavior. Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> Reported-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> --- mm/memory.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 635451a..d5912b0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3404,8 +3404,19 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(pmd_none(*pmd)) && unlikely(__pte_alloc(mm, vma, pmd, address))) return VM_FAULT_OOM; - /* if an huge pmd materialized from under us just retry later */ - if (unlikely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) + /* + * If an huge pmd materialized from under us just retry later. + * Use pmd_trans_unstable() instead of pmd_trans_huge() to + * ensure the pmd didn't become pmd_trans_huge from under us + * and then immediately back to pmd_none as result of + * MADV_DONTNEED running immediately after a huge_pmd fault of + * a different thread of this mm, in turn leading to a false + * negative pmd_trans_huge() retval. All we have to ensure is + * that it is a regular pmd that we can walk with + * pte_offset_map() and we can do that through an atomic read + * in C, which is what pmd_trans_unstable() is provided for. + */ + if (unlikely(pmd_trans_unstable(*pmd) || pmd_devmap(*pmd))) return 0; /* * A regular pmd is established and it can't morph into a huge pmd -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>