On Fri, Aug 02, 2019 at 10:42:33AM -0700, Mike Kravetz wrote: > On 8/1/19 9:15 PM, Naoya Horiguchi wrote: > > On Thu, Aug 01, 2019 at 05:19:41PM -0700, Mike Kravetz wrote: > >> There appears to be a race with hugetlb_fault and try_to_unmap_one of > >> the migration path. > >> > >> Can you try this patch in your environment? I am not sure if it will > >> be the final fix, but just wanted to see if it addresses issue for you. > >> > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> index ede7e7f5d1ab..f3156c5432e3 100644 > >> --- a/mm/hugetlb.c > >> +++ b/mm/hugetlb.c > >> @@ -3856,6 +3856,20 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > >> > >> page = alloc_huge_page(vma, haddr, 0); > >> if (IS_ERR(page)) { > >> + /* > >> + * We could race with page migration (try_to_unmap_one) > >> + * which is modifying page table with lock. However, > >> + * we are not holding lock here. Before returning > >> + * error that will SIGBUS caller, get ptl and make > >> + * sure there really is no entry. > >> + */ > >> + ptl = huge_pte_lock(h, mm, ptep); > >> + if (!huge_pte_none(huge_ptep_get(ptep))) { > >> + ret = 0; > >> + spin_unlock(ptl); > >> + goto out; > >> + } > >> + spin_unlock(ptl); > > > > Thanks you for investigation, Mike. > > I tried this change and found no SIGBUS, so it works well. > > > > I'm still not clear about how !huge_pte_none() becomes true here, > > because we enter hugetlb_no_page() only when huge_pte_none() is non-null > > and (racy) try_to_unmap_one() from page migration should convert the > > huge_pte into a migration entry, not null. > > Thanks for taking a look Naoya. > > In try_to_unmap_one(), there is this code block: > > /* Nuke the page table entry. */ > flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); > if (should_defer_flush(mm, flags)) { > /* > * We clear the PTE but do not flush so potentially > * a remote CPU could still be writing to the page. > * If the entry was previously clean then the > * architecture must guarantee that a clear->dirty > * transition on a cached TLB entry is written through > * and traps if the PTE is unmapped. > */ > pteval = ptep_get_and_clear(mm, address, pvmw.pte); > > set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); > } else { > pteval = ptep_clear_flush(vma, address, pvmw.pte); > } > > That happens before setting the migration entry. Therefore, for a period > of time the pte is NULL (huge_pte_none() returns true). > > try_to_unmap_one holds the page table lock, but hugetlb_fault does not take > the lock to 'optimistically' check huge_pte_none(). When huge_pte_none > returns true, it calls hugetlb_no_page which is where we try to allocate > a page and fails. > > Does that make sense, or am I missing something? Make sense to me, thanks. > > The patch checks for this specific condition: someone changing the pte > from NULL to non-NULL while holding the lock. I am not sure if this is > the best way to fix. But, it may be the easiest. Yes, I think so. - Naoya