On Wed, Sep 20, 2023 at 1:08 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > On Thu, Sep 14, 2023 at 7:28 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > > On Thu, Sep 14, 2023 at 5:26 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > From: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > > > > > This implements the uABI of UFFDIO_REMAP. > > > > > > Notably one mode bitflag is also forwarded (and in turn known) by the > > > lowlevel remap_pages method. [...] > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c [...] > > > +int remap_pages_huge_pmd(struct mm_struct *dst_mm, > > > + struct mm_struct *src_mm, > > > + pmd_t *dst_pmd, pmd_t *src_pmd, > > > + pmd_t dst_pmdval, > > > + struct vm_area_struct *dst_vma, > > > + struct vm_area_struct *src_vma, > > > + unsigned long dst_addr, > > > + unsigned long src_addr) > > > +{ > > > + pmd_t _dst_pmd, src_pmdval; > > > + struct page *src_page; > > > + struct anon_vma *src_anon_vma, *dst_anon_vma; > > > + spinlock_t *src_ptl, *dst_ptl; > > > + pgtable_t pgtable; > > > + struct mmu_notifier_range range; > > > + > > > + src_pmdval = *src_pmd; > > > + src_ptl = pmd_lockptr(src_mm, src_pmd); > > > + > > > + BUG_ON(!pmd_trans_huge(src_pmdval)); > > > + BUG_ON(!pmd_none(dst_pmdval)); > > > > Why can we assert that pmd_none(dst_pmdval) is true here? Can we not > > have concurrent faults (or userfaultfd operations) populating that > > PMD? > > IIUC dst_pmdval is a copy of the value from dst_pmd, so that local > copy should not change even if some concurrent operation changes > dst_pmd. We can assert that it's pmd_none because we checked for that > before calling remap_pages_huge_pmd. Later on we check if dst_pmd > changed from under us (see pmd_same(*dst_pmd, dst_pmdval) check) and > retry if that happened. Oh, right, I don't know what I was thinking when I typed that. But now I wonder about the check directly above that: What does this code do for swap PMDs? It looks like that might splat on the BUG_ON(!pmd_trans_huge(src_pmdval)). All we've checked on the path to here is that the virtual memory area is aligned, that the destination PMD is empty, and that pmd_trans_huge_lock() succeeded; but pmd_trans_huge_lock() explicitly permits swap PMDs (which is the swapped-out version of transhuge PMDs): static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) return __pmd_trans_huge_lock(pmd, vma); else return NULL; } > > > > > + BUG_ON(!spin_is_locked(src_ptl)); > > > + mmap_assert_locked(src_mm); > > > + mmap_assert_locked(dst_mm); > > > + BUG_ON(src_addr & ~HPAGE_PMD_MASK); > > > + BUG_ON(dst_addr & ~HPAGE_PMD_MASK); > > > + > > > + src_page = pmd_page(src_pmdval); > > > + BUG_ON(!PageHead(src_page)); > > > + BUG_ON(!PageAnon(src_page)); > > > + if (unlikely(page_mapcount(src_page) != 1)) { > > > + spin_unlock(src_ptl); > > > + return -EBUSY; > > > + } > > > + > > > + get_page(src_page); > > > + spin_unlock(src_ptl); > > > + > > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_mm, src_addr, > > > + src_addr + HPAGE_PMD_SIZE); > > > + mmu_notifier_invalidate_range_start(&range); > > > + > > > + /* block all concurrent rmap walks */ > > > + lock_page(src_page); > > > + > > > + /* > > > + * split_huge_page walks the anon_vma chain without the page > > > + * lock. Serialize against it with the anon_vma lock, the page > > > + * lock is not enough. > > > + */ > > > + src_anon_vma = folio_get_anon_vma(page_folio(src_page)); > > > + if (!src_anon_vma) { > > > + unlock_page(src_page); > > > + put_page(src_page); > > > + mmu_notifier_invalidate_range_end(&range); > > > + return -EAGAIN; > > > + } > > > + anon_vma_lock_write(src_anon_vma); > > > + > > > + dst_ptl = pmd_lockptr(dst_mm, dst_pmd); > > > + double_pt_lock(src_ptl, dst_ptl); > > > + if (unlikely(!pmd_same(*src_pmd, src_pmdval) || > > > + !pmd_same(*dst_pmd, dst_pmdval) || > > > + page_mapcount(src_page) != 1)) { > > > + double_pt_unlock(src_ptl, dst_ptl); > > > + anon_vma_unlock_write(src_anon_vma); > > > + put_anon_vma(src_anon_vma); > > > + unlock_page(src_page); > > > + put_page(src_page); > > > + mmu_notifier_invalidate_range_end(&range); > > > + return -EAGAIN; > > > + } > > > + > > > + BUG_ON(!PageHead(src_page)); > > > + BUG_ON(!PageAnon(src_page)); > > > + /* the PT lock is enough to keep the page pinned now */ > > > + put_page(src_page); > > > + > > > + dst_anon_vma = (void *) dst_vma->anon_vma + PAGE_MAPPING_ANON; > > > + WRITE_ONCE(src_page->mapping, (struct address_space *) dst_anon_vma); > > > + WRITE_ONCE(src_page->index, linear_page_index(dst_vma, dst_addr)); > > > + > > > + if (!pmd_same(pmdp_huge_clear_flush(src_vma, src_addr, src_pmd), > > > + src_pmdval)) > > > + BUG_ON(1); > > > > I'm not sure we can assert that the PMDs are exactly equal; the CPU > > might have changed the A/D bits under us? > > Yes. I wonder if I can simply remove the BUG_ON here like this: > > src_pmdval = pmdp_huge_clear_flush(src_vma, src_addr, src_pmd); > > Technically we don't use src_pmdval after this but for the possible > future use that would keep things correct. If A/D bits changed from > under us we will still copy correct values into dst_pmd. And when we set up the dst_pmd, we always mark it as dirty and accessed... so I guess that's fine. > > > + _dst_pmd = mk_huge_pmd(src_page, dst_vma->vm_page_prot); > > > + _dst_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma); > > > + set_pmd_at(dst_mm, dst_addr, dst_pmd, _dst_pmd); > > > + > > > + pgtable = pgtable_trans_huge_withdraw(src_mm, src_pmd); > > > + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); > > > > Are we allowed to move page tables between mm_structs on all > > architectures? The first example I found that looks a bit dodgy, > > looking through various architectures' pte_alloc_one(), is s390's > > page_table_alloc() which looks like page tables are tied to per-MM > > lists sometimes. > > If that's not allowed, we might have to allocate a new deposit table > > and free the old one or something like that. > > Hmm. Yeah, looks like in the case of !CONFIG_PGSTE the table can be > linked to mm->context.pgtable_list, so can't be moved to another mm. I > guess I'll have to keep a pgtable allocated, ready to be deposited and > free the old one. Maybe it's worth having an arch-specific function > indicating whether moving a pgtable between MMs is supported? Or do it > separately as an optimization. WDYT? Hm, dunno. I guess you could have architectures opt in with some config flag similar to how flags like ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH are wired up - define it in init/Kconfig, select it in the architectures that support it, and then gate the fast version on that with #ifdef? > > > + if (dst_mm != src_mm) { > > > + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); > > > + add_mm_counter(src_mm, MM_ANONPAGES, -HPAGE_PMD_NR); > > > + } > > > + double_pt_unlock(src_ptl, dst_ptl); > > > + > > > + anon_vma_unlock_write(src_anon_vma); > > > + put_anon_vma(src_anon_vma); > > > + > > > + /* unblock rmap walks */ > > > + unlock_page(src_page); > > > + > > > + mmu_notifier_invalidate_range_end(&range); > > > + return 0; > > > +} > > > +#endif /* CONFIG_USERFAULTFD */ > > > + > > > /* > > > * Returns page table lock pointer if a given pmd maps a thp, NULL otherwise. > > > * > > [...] > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > index 96d9eae5c7cc..0cca60dfa8f8 100644 > > > --- a/mm/userfaultfd.c > > > +++ b/mm/userfaultfd.c > > [...] > > > +ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > > + unsigned long dst_start, unsigned long src_start, > > > + unsigned long len, __u64 mode) > > > +{ > > [...] > > > + > > > + if (pgprot_val(src_vma->vm_page_prot) != > > > + pgprot_val(dst_vma->vm_page_prot)) > > > + goto out; > > > > Does this check intentionally allow moving pages from a > > PROT_READ|PROT_WRITE anonymous private VMA into a PROT_READ anonymous > > private VMA (on architectures like x86 and arm64 where CoW memory has > > the same protection flags as read-only memory), but forbid moving them > > from a PROT_READ|PROT_EXEC VMA into a PROT_READ VMA? I think this > > check needs at least a comment to explain what's going on here. > > The check is simply to ensure the VMAs have the same access > permissions to prevent page copies that should have different > permissions. The fact that x86 and arm64 have the same protection bits > for R/O and COW memory is a "side-effect" IMHO. I'm not sure what > comment would be good here but I'm open to suggestions. I'm not sure if you can do a meaningful security check on the ->vm_page_prot. I also don't think it matters for anonymous VMAs. I guess if you want to keep this check but make this behavior more consistent, you could put another check in front of this that rejects VMAs where vm_flags like VM_READ, VM_WRITE, VM_SHARED or VM_EXEC are different? [...] > > > + /* > > > + * Ensure the dst_vma has a anon_vma or this page > > > + * would get a NULL anon_vma when moved in the > > > + * dst_vma. > > > + */ > > > + err = -ENOMEM; > > > + if (unlikely(anon_vma_prepare(dst_vma))) > > > + goto out; > > > + > > > + for (src_addr = src_start, dst_addr = dst_start; > > > + src_addr < src_start + len;) { > > > + spinlock_t *ptl; > > > + pmd_t dst_pmdval; > > > + > > > + BUG_ON(dst_addr >= dst_start + len); > > > + src_pmd = mm_find_pmd(src_mm, src_addr); > > > > (this would blow up pretty badly if we could have transparent huge PUD > > in the region but I think that's limited to file VMAs so it's fine as > > it currently is) > > Should I add a comment here as a warning if in the future we decide to > implement support for file-backed pages? Hm, yeah, I guess that might be a good idea.