On 8/5/19 1:57 AM, Michal Hocko wrote: > On Fri 02-08-19 10:42:33, Mike Kravetz wrote: >> On 8/1/19 9:15 PM, Naoya Horiguchi wrote: >>> On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote: >>>> There appears to be a race with hugetlb_fault and try_to_unmap_one of >>>> the migration path. >>>> >>>> Can you try this patch in your environment? I am not sure if it will >>>> be the final fix, but just wanted to see if it addresses issue for you. >>>> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index ede7e7f5d1ab..f3156c5432e3 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, >>>> >>>> page = alloc_huge_page(vma, haddr, 0); >>>> if (IS_ERR(page)) { >>>> + /* >>>> + * We could race with page migration (try_to_unmap_one) >>>> + * which is modifying page table with lock. However, >>>> + * we are not holding lock here. Before returning >>>> + * error that will SIGBUS caller, get ptl and make >>>> + * sure there really is no entry. >>>> + */ >>>> + ptl = huge_pte_lock(h, mm, ptep); >>>> + if (!huge_pte_none(huge_ptep_get(ptep))) { >>>> + ret = 0; >>>> + spin_unlock(ptl); >>>> + goto out; >>>> + } >>>> + spin_unlock(ptl); >>> >>> Thanks you for investigation, Mike. >>> I tried this change and found no SIGBUS, so it works well. >>> >>> I'm still not clear about how !huge_pte_none() becomes true here, >>> because we enter hugetlb_no_page() only when huge_pte_none() is non-null >>> and (racy) try_to_unmap_one() from page migration should convert the >>> huge_pte into a migration entry, not null. >> >> Thanks for taking a look Naoya. >> >> In try_to_unmap_one(), there is this code block: >> >> /* Nuke the page table entry. */ >> flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); >> if (should_defer_flush(mm, flags)) { >> /* >> * We clear the PTE but do not flush so potentially >> * a remote CPU could still be writing to the page. >> * If the entry was previously clean then the >> * architecture must guarantee that a clear->dirty >> * transition on a cached TLB entry is written through >> * and traps if the PTE is unmapped. >> */ >> pteval = ptep_get_and_clear(mm, address, pvmw.pte); >> >> set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); >> } else { >> pteval = ptep_clear_flush(vma, address, pvmw.pte); >> } >> >> That happens before setting the migration entry. Therefore, for a period >> of time the pte is NULL (huge_pte_none() returns true). >> >> try_to_unmap_one holds the page table lock, but hugetlb_fault does not take >> the lock to 'optimistically' check huge_pte_none(). When huge_pte_none >> returns true, it calls hugetlb_no_page which is where we try to allocate >> a page and fails. >> >> Does that make sense, or am I missing something? >> >> The patch checks for this specific condition: someone changing the pte >> from NULL to non-NULL while holding the lock. I am not sure if this is >> the best way to fix. But, it may be the easiest. > > Please add a comment to explain this because this is quite subtle and > tricky. Unlike the regular page fault hugetlb_no_page is protected by a > large lock so a retry check seems unexpected. Will do. Fixing up hugetlbfs locking is still 'on my list'. There are known issues. The last RFC/attempt was this: http://lkml.kernel.org/r/20190201221705.15622-1-mike.kravetz@xxxxxxxxxx I believe that patch would have handled this issue. However, as mentioned above it may better to just patch this issue exposed by LTP and work on the more comprehensive change in the background. -- Mike Kravetz