Re: [RFC PATCH 1/7] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON

Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> · Sat, 22 Mar 2025 07:07:48 +0000

Thanks for the early review :) I gather you're not coming to LSF? Which is
a huge pity, would have loved to see you there.

But this is much appreciated! :)

On Sat, Mar 22, 2025 at 01:14:39AM +0100, Jann Horn wrote:
> On Fri, Mar 21, 2025 at 10:54 PM Lorenzo Stoakes
> <lorenzo.stoakes@xxxxxxxxxx> wrote:
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index 0865387531ed..bb67562a0114 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> [...]
> > +/*
> > + * If the folio mapped at the specified pte entry can have its index and mapping
> > + * relocated, then do so.
> > + *
> > + * Returns the number of pages we have traversed, or 0 if the operation failed.
> > + */
> > +static unsigned long relocate_anon(struct pagetable_move_control *pmc,
> > +               unsigned long old_addr, unsigned long new_addr, pte_t pte,
> > +               bool undo)
> > +{
> > +       struct page *page;
> > +       struct folio *folio;
> > +       struct vm_area_struct *old, *new;
> > +       pgoff_t new_index;
> > +       unsigned long ret = 1;
> > +
> > +       old = pmc->old;
> > +       new = pmc->new;
> > +
> > +       /* Ensure we have truly got an anon folio. */
> > +       page = vm_normal_page(old, old_addr, pte);
> > +       if (!page)
> > +               return ret;
> > +       folio = page_folio(page);
> > +       folio_lock(folio);
> > +
> > +       /* no-op. */
> > +       if (!folio_test_anon(folio) || folio_test_ksm(folio))
> > +               goto out;
> > +
> > +       /*
> > +        * This should not happen as we explicitly disallow this, but check
> > +        * anyway.
> > +        */
> > +       if (folio_test_large(folio)) {
> > +               ret = 0;
> > +               goto out;
> > +       }
>
> Do I understand correctly that you assume here that the page is
> exclusively mapped? Maybe we could at least
> WARN_ON(folio_mapcount(folio) != 1) or something like that?

Ack and agreed, will add!

>
> (I was also wondering if the PageAnonExclusive bit is somehow
> relevant, but we should probably not look at or touch that here,
> unless we want to think about cases where we _used to_ have a child
> from which the page may have been GUP'd...)

Yeah I was thinking the same re: this flag, but given locks we hold this
should not be the case, however you later do note the point that I need to
check anon_vma->num_children == 1 (i.e. self-parented) to ensure that we
aren't looking at a mapping in a parent of a child which would imply a
folio that maybe isn't.

>
> > +       if (!undo)
> > +               new_index = linear_page_index(new, new_addr);
> > +       else
> > +               new_index = linear_page_index(old, old_addr);
> > +
> > +       /*
> > +        * The PTL should keep us safe from unmapping, and the fact the folio is
> > +        * a PTE keeps the folio referenced.
> > +        *
> > +        * The mmap/VMA locks should keep us safe from fork and other processes.
> > +        *
> > +        * The rmap locks should keep us safe from anything happening to the
> > +        * VMA/anon_vma.
>
> "The rmap locks"? I only see that we're holding the rmap lock on the
> new anon_vma; are we also holding a lock on the old anon_vma?

I should rephrase (this is partly because I kept changing how I did the
locking - hey I did warn in preamble this is early :P)

I don't believe we need to hold the _source_ rmap lock, because we
establish a folio lock prior to adjusting the folio so it shouldn't be
possible for an rmap walk to go terribly wrong here.

But do correct me if you feel this is an invalid assumption.

>
> > +        * The folio lock should keep us safe from reclaim, migration, etc.
> > +        */
> > +       folio_move_anon_rmap(folio, undo ? old : new);
> > +       WRITE_ONCE(folio->index, new_index);
> > +
> > +out:
> > +       folio_unlock(folio);
> > +       return ret;
> > +}
> [...]
> > +static bool relocate_anon_ptes(struct pagetable_move_control *pmc,
> > +               unsigned long extent, pmd_t *pmd, bool undo)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct pte_state state = {
> > +               .old_addr = pmc->old_addr,
> > +               .new_addr = pmc->new_addr,
> > +               .old_end = pmc->old_addr + extent,
> > +       };
> > +       spinlock_t *ptl;
> > +       pte_t *ptep_start;
> > +       bool ret;
> > +       unsigned long nr_pages;
> > +
> > +       ptep_start = pte_offset_map_lock(mm, pmd, pmc->old_addr, &ptl);
> > +       /*
> > +        * We prevent faults with mmap write lock, hold the rmap lock and should
> > +        * not fail to obtain this lock. Just give up if we can't.
> > +        */
> > +       if (!ptep_start)
> > +               return false;
> > +
> > +       state.ptep = ptep_start;
> > +
> > +       for (; !pte_done(&state); pte_next(&state, nr_pages)) {
> > +               pte_t pte = ptep_get(state.ptep);
> > +
> > +               if (pte_none(pte) || !pte_present(pte)) {
> > +                       nr_pages = 1;
> > +                       continue;
> > +               }
> > +
> > +               nr_pages = relocate_anon(pmc, state.old_addr, state.new_addr, pte, undo);
> > +               if (!nr_pages) {
> > +                       ret = false;
> > +                       goto out;
> > +               }
> > +       }
> > +
> > +       ret = true;
> > +out:
> > +       pte_unmap_unlock(ptep_start, ptl);
> > +       return ret;
> > +}
>
> Just to make sure I understand correctly:
> This function is changing the ->pgoff and ->mapping of pages while
> they are still mapped in the old VMA, right? And if that fails midway
> through for whatever reason, we go and change all the already-changed
> ->pgoff and ->mapping values back?

Yup. Write lock is held on both VMAs, and rmap lock for new VMA.

>
> [...]
> > @@ -1132,6 +1347,42 @@ static void unmap_source_vma(struct vma_remap_struct *vrm)
> >         }
> >  }
> >
> > +/*
> > + * Should we attempt to relocate anonymous folios to the location that the VMA
> > + * is being moved to by updating index and mapping fields accordingly?
> > + */
> > +static bool should_relocate_anon(struct vma_remap_struct *vrm,
> > +       struct pagetable_move_control *pmc)
> > +{
> > +       struct vm_area_struct *old = vrm->vma;
> > +
> > +       /* Currently we only do this if requested. */
> > +       if (!(vrm->flags & MREMAP_RELOCATE_ANON))
> > +               return false;
> > +
> > +       /* We can't deal with special or hugetlb mappings. */
> > +       if (old->vm_flags & (VM_SPECIAL | VM_HUGETLB))
> > +               return false;
> > +
> > +       /* We only support anonymous mappings. */
> > +       if (!vma_is_anonymous(old))
> > +               return false;
> > +
> > +       /* If no folios are mapped, then no need to attempt this. */
> > +       if (!old->anon_vma)
> > +               return false;
> > +
> > +       /*
> > +        * If the old VMA is a child (i.e. has been forked), then the index
> > +        * references multiple VMAs, we have to bail.
> > +        */
> > +       if (!list_is_singular(&old->anon_vma_chain))
> > +               return false;
>
> I think this is wrong: list_is_singular(&old->anon_vma_chain) tells
> you whether pages in the VMA might be shared due to this mm_struct
> being forked from a parent mm_struct; but it won't tell you whether
> pages in the VMA might be shared due to this mm_struct being the
> parent of another mm_struct (other way around). I guess checking
> old->anon_vma->root->num_children could maybe work...

Yeah you're completely right, this is entirely an oversight on my part.

I don't think we need to be looking at old->anon_vma->root though? As for
anon_vma->root != anon_vma here we'd need to be a child, and we just
asserted we're not right? (could be an additional check though).

But definitely need an old->anon_vma->num_children check.

Holding old vma lock should prevent further forks, so we could just take
the rmap lock - lock - rmap unlock here?

>
> > +
> > +       /* Otherwise, we're good to go! */
> > +       return true;
> > +}
> > +
> >  /*
> >   * Copy vrm->vma over to vrm->new_addr possibly adjusting size as part of the
> >   * process. Additionally handle an error occurring on moving of page tables,
> > @@ -1151,9 +1402,11 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> >         struct vm_area_struct *new_vma;
> >         int err = 0;
> >         PAGETABLE_MOVE(pmc, NULL, NULL, vrm->addr, vrm->new_addr, vrm->old_len);
> > +       bool relocate_anon = should_relocate_anon(vrm, &pmc);
> >
> > +again:
> >         new_vma = copy_vma(&vma, vrm->new_addr, vrm->new_len, new_pgoff,
> > -                          &pmc.need_rmap_locks);
> > +                          &pmc.need_rmap_locks, &relocate_anon);
> >         if (!new_vma) {
> >                 vrm_uncharge(vrm);
> >                 *new_vma_ptr = NULL;
> > @@ -1163,12 +1416,66 @@ static int copy_vma_and_data(struct vma_remap_struct *vrm,
> >         pmc.old = vma;
> >         pmc.new = new_vma;
> >
> > +       if (relocate_anon) {
> > +               /*
> > +                * We have a new VMA to reassign folios to. We take a lock on
> > +                * its anon_vma so reclaim doesn't fail to unmap mappings.
> > +                *
> > +                * We have acquired a VMA write lock by now (in vma_link()), so
> > +                * we do not have to worry about racing faults.
> > +                */
> > +               anon_vma_lock_write(new_vma->anon_vma);
> > +               pmc.relocate_locked = new_vma;
> > +
> > +               if (!relocate_anon_folios(&pmc, /* undo= */false)) {
> > +                       unsigned long start = new_vma->vm_start;
> > +                       unsigned long size = new_vma->vm_end - start;
> > +
> > +                       /* Undo if fails. */
> > +                       relocate_anon_folios(&pmc, /* undo= */true);
>
> This relocate_anon_folios() has to make sure to only operate on the
> subset of PTEs that we already edited successfully, right?

Assumption is that we'd hit the failure again at the end of the subset, we
always start from the beginning.

This may or may not be a safe one :)

>
> > +                       vrm_stat_account(vrm, vrm->new_len);
> > +
> > +                       anon_vma_unlock_write(new_vma->anon_vma);
> > +                       pmc.relocate_locked = NULL;
> > +
> > +                       do_munmap(current->mm, start, size, NULL);
> > +                       relocate_anon = false;
> > +                       goto again;
> > +               }
> > +       }
> [...]
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 5418eef3a852..09027448753f 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> [...]
> > @@ -1821,6 +1834,14 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> >                         new_vma->vm_ops->open(new_vma);
> >                 if (vma_link(mm, new_vma))
> >                         goto out_vma_link;
> > +               /*
> > +                * If we're attempting to relocate anonymous VMAs, we
> > +                * don't want to reuse an anon_vma as set by
> > +                * vm_area_dup(), or copy anon_vma_chain or anything
> > +                * like this.
> > +                */
> > +               if (*relocate_anon && __anon_vma_prepare(new_vma))
> > +                       goto out_vma_link;
>
> Is this bailout really okay? We go to the same label as if vma_link()
> failed, but we're in a very different state: For example, the VMA has
> already been inserted into the VMA tree with vma_iter_store() as part
> of vma_link() (unless that changed in one of the prerequisite
> patches).

Yeah I think you're right, this is an oversight (hey - error paths - fun
right??) - will correct.

>
> >                 *need_rmap_locks = false;
> >         }
> >         return new_vma;