Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache

Peter Xu <peterx@xxxxxxxxxx> · Mon, 24 Feb 2025 12:50:27 -0500

On Sun, Feb 23, 2025 at 10:31:37AM +1300, Barry Song wrote:
> On Fri, Feb 21, 2025 at 2:49 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > On Fri, Feb 21, 2025 at 01:07:24PM +1300, Barry Song wrote:
> > > On Fri, Feb 21, 2025 at 12:32 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote:
> > > > > 2. src_anon_vma and its lock – swapcache doesn’t require it（folio is not mapped）
> > > >
> > > > Could you help explain what guarantees the rmap walk not happen on a
> > > > swapcache page?
> > > >
> > > > I'm not familiar with this path, though at least I see damon can start a
> > > > rmap walk on PageAnon almost with no locking..  some explanations would be
> > > > appreciated.
> > >
> > > I am observing the following in folio_referenced(), which the anon_vma lock
> > > was originally intended to protect.
> > >
> > >         if (!pra.mapcount)
> > >                 return 0;
> > >
> > > I assume all other rmap walks should do the same?
> >
> > Yes normally there'll be a folio_mapcount() check, however..
> >
> > >
> > > int folio_referenced(struct folio *folio, int is_locked,
> > >                      struct mem_cgroup *memcg, unsigned long *vm_flags)
> > > {
> > >
> > >         bool we_locked = false;
> > >         struct folio_referenced_arg pra = {
> > >                 .mapcount = folio_mapcount(folio),
> > >                 .memcg = memcg,
> > >         };
> > >
> > >         struct rmap_walk_control rwc = {
> > >                 .rmap_one = folio_referenced_one,
> > >                 .arg = (void *)&pra,
> > >                 .anon_lock = folio_lock_anon_vma_read,
> > >                 .try_lock = true,
> > >                 .invalid_vma = invalid_folio_referenced_vma,
> > >         };
> > >
> > >         *vm_flags = 0;
> > >         if (!pra.mapcount)
> > >                 return 0;
> > >         ...
> > > }
> > >
> > > By the way, since the folio has been under reclamation in this case and
> > > isn't in the lru, this should also prevent the rmap walk, right?
> >
> > .. I'm not sure whether it's always working.
> >
> > The thing is anon doesn't even require folio lock held during (1) checking
> > mapcount and (2) doing the rmap walk, in all similar cases as above.  I see
> > nothing blocks it from a concurrent thread zapping that last mapcount:
> >
> >                thread 1                         thread 2
> >                --------                         --------
> >         [whatever scanner]
> >            check folio_mapcount(), non-zero
> >                                                 zap the last map.. then mapcount==0
> >            rmap_walk()
> >
> > Not sure if I missed something.
> >
> > The other thing is IIUC swapcache page can also have chance to be faulted
> > in but only if a read not write.  I actually had a feeling that your
> > reproducer triggered that exact path, causing a read swap in, reusing the
> > swapcache page, and hit the sanity check there somehow (even as mentioned
> > in the other reply, I don't yet know why the 1st check didn't seem to
> > work.. as we do check folio->index twice..).
> >
> > Said that, I'm not sure if above concern will happen in this specific case,
> > as UIFFDIO_MOVE is pretty special, that we check exclusive bit first in swp
> > entry so we know it's definitely not mapped elsewhere, meanwhile if we hold
> > pgtable lock so maybe it can't get mapped back.. it is just still tricky,
> > at least we do some dances all over releasing and retaking locks.
> >
> > We could either justify that's safe, or maybe still ok and simpler if we
> > could take anon_vma write lock, making sure nobody will be able to read the
> > folio->index when it's prone to an update.
> 
> What prompted me to do the former is that folio_get_anon_vma() returns
> NULL for an unmapped folio. As for the latter, we need to carefully evaluate
> whether the change below is safe.
> 
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -505,7 +505,7 @@ struct anon_vma *folio_get_anon_vma(const struct
> folio *folio)
>         anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 goto out;
> 
> -       if (!folio_mapped(folio))
> +       if (!folio_mapped(folio) && !folio_test_swapcache(folio))
>                 goto out;
> 
>         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> @@ -521,7 +521,7 @@ struct anon_vma *folio_get_anon_vma(const struct
> folio *folio)
>          * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero()
>          * above cannot corrupt).
>          */

[1]

> 
> -       if (!folio_mapped(folio)) {
> +       if (!folio_mapped(folio) && !folio_test_swapcache(folio)) {
>                 rcu_read_unlock();
>                 put_anon_vma(anon_vma);
>                 return NULL;

Hmm, this let me go back read again on how we manage anon_vma lifespan,
then I just noticed this may not work.

See the comment right above [1], here's a full version:

	/*
	 * If this folio is still mapped, then its anon_vma cannot have been
	 * freed.  But if it has been unmapped, we have no security against the
	 * anon_vma structure being freed and reused (for another anon_vma:
	 * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero()
	 * above cannot corrupt).
	 */

So afaiu that means we pretty much very rely upon folio_mapped() check to
make sure anon_vma being valid at all that we fetched from folio->mapping,
not to mention the rmap walk later afterwards.

Then above diff in folio_get_anon_vma() should be problematic, as when
"folio_mapped()==false && folio_test_swapcache()==true", above change will
start to return anon_vma pointer even if the anon_vma could have been freed
and reused by other VMAs.

> 
> 
> The above change, combined with the change below, has also resolved the mTHP
> -EBUSY issue.
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index e5718835a964..1ef991b5c225 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1333,6 +1333,7 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                 pte_unmap(&orig_src_pte);
>                 pte_unmap(&orig_dst_pte);
>                 src_pte = dst_pte = NULL;
> +               folio_wait_writeback(src_folio);
>                 err = split_folio(src_folio);
> 
>                 if (err)
>                         goto out;
> @@ -1343,7 +1344,7 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                 goto retry;
>         }
> 
> -       if (!src_anon_vma && pte_present(orig_src_pte)) {
> +       if (!src_anon_vma) {
>                 /*
>                  * folio_referenced walks the anon_vma chain
>                  * without the folio lock. Serialize against it with
> 
> 
> split_folio() returns -EBUSY if the folio is under writeback or if
> folio_get_anon_vma() returns NULL.
> 
> I have no issues with the latter, provided the change in folio_get_anon_vma()
> is safe, as it also resolves the mTHP -EBUSY issue.
> 
> We need to carefully consider the five places where folio_get_anon_vma() is
> called, as this patch will also be backported to stable.
> 
>   1   2618  mm/huge_memory.c <<move_pages_huge_pmd>>
>              src_anon_vma = folio_get_anon_vma(src_folio);
> 
>    2   3765  mm/huge_memory.c <<__folio_split>>
>              anon_vma = folio_get_anon_vma(folio);
> 
>    3   1280  mm/migrate.c <<migrate_folio_unmap>>
>              anon_vma = folio_get_anon_vma(src);
> 
>    4   1485  mm/migrate.c <<unmap_and_move_huge_page>>
>              anon_vma = folio_get_anon_vma(src);
> 
>    5   1354  mm/userfaultfd.c <<move_pages_pte>>
>              src_anon_vma = folio_get_anon_vma(src_folio);

If my above understanding is correct, we may indeed need alternative plans.
Not sure whether others have thoughts, but two things I am thinking out
loud..

  - Justify rmap on the swap cache folio not possible: I think if
    folio_mapped() is required for any anon rmap walk (which I didn't
    notice previously..), and we know this is exclusive swap entry and
    keeps true (so nobody will be able to race and make the swapcache
    mapped during the whole process), maybe we are able to justify any rmap
    won't happen at all, because they should fail at folio_mapped() check.
    Then we update the folio->mapping & folio->index directly without
    anon_vma locking.  A rich comment would be helpful in this case..

  - I wonder if it's possible we free the swap cache if it's already
    writeback to backend storage and clean.  Then moving the swp entry
    alone looks safe when without the swapcache present.

Thanks,

-- 
Peter Xu