On Fri, Feb 21, 2025 at 2:49 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Fri, Feb 21, 2025 at 01:07:24PM +1300, Barry Song wrote: > > On Fri, Feb 21, 2025 at 12:32 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote: > > > > 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped) > > > > > > Could you help explain what guarantees the rmap walk not happen on a > > > swapcache page? > > > > > > I'm not familiar with this path, though at least I see damon can start a > > > rmap walk on PageAnon almost with no locking.. some explanations would be > > > appreciated. > > > > I am observing the following in folio_referenced(), which the anon_vma lock > > was originally intended to protect. > > > > if (!pra.mapcount) > > return 0; > > > > I assume all other rmap walks should do the same? > > Yes normally there'll be a folio_mapcount() check, however.. > > > > > int folio_referenced(struct folio *folio, int is_locked, > > struct mem_cgroup *memcg, unsigned long *vm_flags) > > { > > > > bool we_locked = false; > > struct folio_referenced_arg pra = { > > .mapcount = folio_mapcount(folio), > > .memcg = memcg, > > }; > > > > struct rmap_walk_control rwc = { > > .rmap_one = folio_referenced_one, > > .arg = (void *)&pra, > > .anon_lock = folio_lock_anon_vma_read, > > .try_lock = true, > > .invalid_vma = invalid_folio_referenced_vma, > > }; > > > > *vm_flags = 0; > > if (!pra.mapcount) > > return 0; > > ... > > } > > > > By the way, since the folio has been under reclamation in this case and > > isn't in the lru, this should also prevent the rmap walk, right? > > .. I'm not sure whether it's always working. > > The thing is anon doesn't even require folio lock held during (1) checking > mapcount and (2) doing the rmap walk, in all similar cases as above. I see > nothing blocks it from a concurrent thread zapping that last mapcount: > > thread 1 thread 2 > -------- -------- > [whatever scanner] > check folio_mapcount(), non-zero > zap the last map.. then mapcount==0 > rmap_walk() > > Not sure if I missed something. > > The other thing is IIUC swapcache page can also have chance to be faulted > in but only if a read not write. I actually had a feeling that your > reproducer triggered that exact path, causing a read swap in, reusing the > swapcache page, and hit the sanity check there somehow (even as mentioned > in the other reply, I don't yet know why the 1st check didn't seem to > work.. as we do check folio->index twice..). > > Said that, I'm not sure if above concern will happen in this specific case, > as UIFFDIO_MOVE is pretty special, that we check exclusive bit first in swp > entry so we know it's definitely not mapped elsewhere, meanwhile if we hold > pgtable lock so maybe it can't get mapped back.. it is just still tricky, > at least we do some dances all over releasing and retaking locks. > > We could either justify that's safe, or maybe still ok and simpler if we > could take anon_vma write lock, making sure nobody will be able to read the > folio->index when it's prone to an update. What prompted me to do the former is that folio_get_anon_vma() returns NULL for an unmapped folio. As for the latter, we need to carefully evaluate whether the change below is safe. --- a/mm/rmap.c +++ b/mm/rmap.c @@ -505,7 +505,7 @@ struct anon_vma *folio_get_anon_vma(const struct folio *folio) anon_mapping = (unsigned long)READ_ONCE(folio->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) goto out; - if (!folio_mapped(folio)) + if (!folio_mapped(folio) && !folio_test_swapcache(folio)) goto out; anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON); @@ -521,7 +521,7 @@ struct anon_vma *folio_get_anon_vma(const struct folio *folio) * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero() * above cannot corrupt). */ - if (!folio_mapped(folio)) { + if (!folio_mapped(folio) && !folio_test_swapcache(folio)) { rcu_read_unlock(); put_anon_vma(anon_vma); return NULL; The above change, combined with the change below, has also resolved the mTHP -EBUSY issue. diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e5718835a964..1ef991b5c225 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1333,6 +1333,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pte_unmap(&orig_src_pte); pte_unmap(&orig_dst_pte); src_pte = dst_pte = NULL; + folio_wait_writeback(src_folio); err = split_folio(src_folio); if (err) goto out; @@ -1343,7 +1344,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, goto retry; } - if (!src_anon_vma && pte_present(orig_src_pte)) { + if (!src_anon_vma) { /* * folio_referenced walks the anon_vma chain * without the folio lock. Serialize against it with split_folio() returns -EBUSY if the folio is under writeback or if folio_get_anon_vma() returns NULL. I have no issues with the latter, provided the change in folio_get_anon_vma() is safe, as it also resolves the mTHP -EBUSY issue. We need to carefully consider the five places where folio_get_anon_vma() is called, as this patch will also be backported to stable. 1 2618 mm/huge_memory.c <<move_pages_huge_pmd>> src_anon_vma = folio_get_anon_vma(src_folio); 2 3765 mm/huge_memory.c <<__folio_split>> anon_vma = folio_get_anon_vma(folio); 3 1280 mm/migrate.c <<migrate_folio_unmap>> anon_vma = folio_get_anon_vma(src); 4 1485 mm/migrate.c <<unmap_and_move_huge_page>> anon_vma = folio_get_anon_vma(src); 5 1354 mm/userfaultfd.c <<move_pages_pte>> src_anon_vma = folio_get_anon_vma(src_folio); > > Thanks, > > -- > Peter Xu > Thanks barry