On Mon, 22 May 2023, Qi Zheng wrote: > On 2023/5/22 13:26, Hugh Dickins wrote: > > handle_pte_fault() use pte_offset_map_nolock() to get the vmf.ptl which > > corresponds to vmf.pte, instead of pte_lockptr() being used later, when > > there's a chance that the pmd entry might have changed, perhaps to none, > > or to a huge pmd, with no split ptlock in its struct page. > > > > Remove its pmd_devmap_trans_unstable() call: pte_offset_map_nolock() > > will handle that case by failing. Update the "morph" comment above, > > looking forward to when shmem or file collapse to THP may not take > > mmap_lock for write (or not at all). > > > > do_numa_page() use the vmf->ptl from handle_pte_fault() at first, but > > refresh it when refreshing vmf->pte. > > > > do_swap_page()'s pte_unmap_same() (the thing that takes ptl to verify a > > two-part PAE orig_pte) use the vmf->ptl from handle_pte_fault() too; but > > do_swap_page() is also used by anon THP's __collapse_huge_page_swapin(), > > so adjust that to set vmf->ptl by pte_offset_map_nolock(). > > > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > > --- > > mm/khugepaged.c | 6 ++++-- > > mm/memory.c | 38 +++++++++++++------------------------- > > 2 files changed, 17 insertions(+), 27 deletions(-) > > ... > > diff --git a/mm/memory.c b/mm/memory.c > > index c7b920291a72..4ec46eecefd3 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c ... > > @@ -4897,27 +4897,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault > > *vmf) > > vmf->pte = NULL; > > vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID; > > } else { > > - /* > > - * If a huge pmd materialized under us just retry later. Use > > - * pmd_trans_unstable() via pmd_devmap_trans_unstable() > > instead > > - * of pmd_trans_huge() to ensure the pmd didn't become > > - * pmd_trans_huge under us and then back to pmd_none, as a > > - * result of MADV_DONTNEED running immediately after a huge > > pmd > > - * fault in a different thread of this mm, in turn leading to > > a > > - * misleading pmd_trans_huge() retval. All we have to ensure > > is > > - * that it is a regular pmd that we can walk with > > - * pte_offset_map() and we can do that through an atomic read > > - * in C, which is what pmd_trans_unstable() provides. > > - */ > > - if (pmd_devmap_trans_unstable(vmf->pmd)) > > - return 0; > > /* > > * A regular pmd is established and it can't morph into a huge > > - * pmd from under us anymore at this point because we hold the > > - * mmap_lock read mode and khugepaged takes it in write mode. > > - * So now it's safe to run pte_offset_map(). > > + * pmd by anon khugepaged, since that takes mmap_lock in write > > + * mode; but shmem or file collapse to THP could still morph > > + * it into a huge pmd: just retry later if so. > > */ > > - vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > > + vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, > > + vmf->address, &vmf->ptl); > > + if (unlikely(!vmf->pte)) > > + return 0; > > Just jump to the retry label below? Shrug. Could do. But again I saw no reason to optimize this path, the pmd_devmap_trans_unstable() treatment sets a good enough example. Hugh