On Wed, Sep 20, 2023 at 3:49 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > On Tue, Sep 19, 2023 at 4:51 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > On Wed, Sep 20, 2023 at 1:08 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > On Thu, Sep 14, 2023 at 7:28 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > > > On Thu, Sep 14, 2023 at 5:26 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > > From: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > > > > > > > > > This implements the uABI of UFFDIO_REMAP. > > > > > > > > > > Notably one mode bitflag is also forwarded (and in turn known) by the > > > > > lowlevel remap_pages method. > > [...] > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > [...] > > > > > +int remap_pages_huge_pmd(struct mm_struct *dst_mm, > > > > > + struct mm_struct *src_mm, > > > > > + pmd_t *dst_pmd, pmd_t *src_pmd, > > > > > + pmd_t dst_pmdval, > > > > > + struct vm_area_struct *dst_vma, > > > > > + struct vm_area_struct *src_vma, > > > > > + unsigned long dst_addr, > > > > > + unsigned long src_addr) > > > > > +{ > > > > > + pmd_t _dst_pmd, src_pmdval; > > > > > + struct page *src_page; > > > > > + struct anon_vma *src_anon_vma, *dst_anon_vma; > > > > > + spinlock_t *src_ptl, *dst_ptl; > > > > > + pgtable_t pgtable; > > > > > + struct mmu_notifier_range range; > > > > > + > > > > > + src_pmdval = *src_pmd; > > > > > + src_ptl = pmd_lockptr(src_mm, src_pmd); > > > > > + > > > > > + BUG_ON(!pmd_trans_huge(src_pmdval)); > > > > > + BUG_ON(!pmd_none(dst_pmdval)); > > > > > > > > Why can we assert that pmd_none(dst_pmdval) is true here? Can we not > > > > have concurrent faults (or userfaultfd operations) populating that > > > > PMD? > > > > > > IIUC dst_pmdval is a copy of the value from dst_pmd, so that local > > > copy should not change even if some concurrent operation changes > > > dst_pmd. We can assert that it's pmd_none because we checked for that > > > before calling remap_pages_huge_pmd. Later on we check if dst_pmd > > > changed from under us (see pmd_same(*dst_pmd, dst_pmdval) check) and > > > retry if that happened. > > > > Oh, right, I don't know what I was thinking when I typed that. > > > > But now I wonder about the check directly above that: What does this > > code do for swap PMDs? It looks like that might splat on the > > BUG_ON(!pmd_trans_huge(src_pmdval)). All we've checked on the path to > > here is that the virtual memory area is aligned, that the destination > > PMD is empty, and that pmd_trans_huge_lock() succeeded; but > > pmd_trans_huge_lock() explicitly permits swap PMDs (which is the > > swapped-out version of transhuge PMDs): > > > > static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, > > struct vm_area_struct *vma) > > { > > if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) > > return __pmd_trans_huge_lock(pmd, vma); > > else > > return NULL; > > } > > Yeah... Ok, I think I'm missing a check for pmd_trans_huge(*src_pmd) > after we lock it with pmd_trans_huge_lock(src_pmd, src_vma). And we > can remove the above BUG_ON(). Would that address your concern? Sounds good. It'll end up splitting huge swap entries but I guess the extra code for moving huge swap entries might not be worth it.