The patch titled Subject: mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one() has been added to the -mm mm-unstable branch. Its filename is mm-rmap-handle-device-exclusive-entries-correctly-in-try_to_migrate_one.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-rmap-handle-device-exclusive-entries-correctly-in-try_to_migrate_one.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: David Hildenbrand <david@xxxxxxxxxx> Subject: mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one() Date: Mon, 10 Feb 2025 20:37:53 +0100 Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk(). try_to_migrate_one() is not prepared for that, so teach it about these PFN swap PTEs. We already handle device-private entries by specializing on the folio, so we can reshuffle that code to make it work on the PFN swap PTEs instead. Get rid of the folio_is_device_private() handling. Note that we never currently expect device-private folios with HWPoison flag set at that point, so add a warning in case that ever changes and we can figure out what the right thing to do is. Note that we could currently only run into this case with device-exclusive entries on THPs. We still adjust the mapcount on conversion to device-exclusive; this makes the rmap walk abort early for small folios, because we'll always have !folio_mapped() with a single device-exclusive entry. We'll adjust the mapcount logic once all page_vma_mapped_walk() users can properly handle device-exclusive entries. Further note that try_to_migrate() calls MMU notifiers and holds the folio lock, so any device-exclusive users should be properly prepared for a device-exclusive PTE to "vanish". Link: https://lkml.kernel.org/r/20250210193801.781278-12-david@xxxxxxxxxx Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> Cc: Alex Shi <alexs@xxxxxxxxxx> Cc: Alistair Popple <apopple@xxxxxxxxxx> Cc: Danilo Krummrich <dakr@xxxxxxxxxx> Cc: Dave Airlie <airlied@xxxxxxxxx> Cc: Jann Horn <jannh@xxxxxxxxxx> Cc: Jason Gunthorpe <jgg@xxxxxxxxxx> Cc: Jerome Glisse <jglisse@xxxxxxxxxx> Cc: John Hubbard <jhubbard@xxxxxxxxxx> Cc: Jonathan Corbet <corbet@xxxxxxx> Cc: Karol Herbst <kherbst@xxxxxxxxxx> Cc: Liam Howlett <liam.howlett@xxxxxxxxxx> Cc: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> Cc: Lyude <lyude@xxxxxxxxxx> Cc: "Masami Hiramatsu (Google)" <mhiramat@xxxxxxxxxx> Cc: Oleg Nesterov <oleg@xxxxxxxxxx> Cc: Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> Cc: Peter Xu <peterx@xxxxxxxxxx> Cc: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> Cc: SeongJae Park <sj@xxxxxxxxxx> Cc: Simona Vetter <simona.vetter@xxxxxxxx> Cc: Vlastimil Babka <vbabka@xxxxxxx> Cc: Yanteng Si <si.yanteng@xxxxxxxxx> Cc: Barry Song <v-songbaohua@xxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/rmap.c | 124 +++++++++++++++++++++------------------------------- 1 file changed, 51 insertions(+), 73 deletions(-) --- a/mm/rmap.c~mm-rmap-handle-device-exclusive-entries-correctly-in-try_to_migrate_one +++ a/mm/rmap.c @@ -2039,9 +2039,9 @@ static bool try_to_migrate_one(struct fo { struct mm_struct *mm = vma->vm_mm; DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); + bool anon_exclusive, writable, ret = true; pte_t pteval; struct page *subpage; - bool anon_exclusive, ret = true; struct mmu_notifier_range range; enum ttu_flags flags = (enum ttu_flags)(long)arg; unsigned long pfn; @@ -2108,24 +2108,19 @@ static bool try_to_migrate_one(struct fo /* Unexpected PMD-mapped THP? */ VM_BUG_ON_FOLIO(!pvmw.pte, folio); - pfn = pte_pfn(ptep_get(pvmw.pte)); - - if (folio_is_zone_device(folio)) { - /* - * Our PTE is a non-present device exclusive entry and - * calculating the subpage as for the common case would - * result in an invalid pointer. - * - * Since only PAGE_SIZE pages can currently be - * migrated, just set it to page. This will need to be - * changed when hugepage migrations to device private - * memory are supported. - */ - VM_BUG_ON_FOLIO(folio_nr_pages(folio) > 1, folio); - subpage = &folio->page; + /* + * Handle PFN swap PTEs, such as device-exclusive ones, that + * actually map pages. + */ + pteval = ptep_get(pvmw.pte); + if (likely(pte_present(pteval))) { + pfn = pte_pfn(pteval); } else { - subpage = folio_page(folio, pfn - folio_pfn(folio)); + pfn = swp_offset_pfn(pte_to_swp_entry(pteval)); + VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); } + + subpage = folio_page(folio, pfn - folio_pfn(folio)); address = pvmw.address; anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(subpage); @@ -2181,7 +2176,10 @@ static bool try_to_migrate_one(struct fo } /* Nuke the hugetlb page table entry */ pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); - } else { + if (pte_dirty(pteval)) + folio_mark_dirty(folio); + writable = pte_write(pteval); + } else if (likely(pte_present(pteval))) { flush_cache_page(vma, address, pfn); /* Nuke the page table entry. */ if (should_defer_flush(mm, flags)) { @@ -2199,54 +2197,23 @@ static bool try_to_migrate_one(struct fo } else { pteval = ptep_clear_flush(vma, address, pvmw.pte); } + if (pte_dirty(pteval)) + folio_mark_dirty(folio); + writable = pte_write(pteval); + } else { + pte_clear(mm, address, pvmw.pte); + writable = is_writable_device_private_entry(pte_to_swp_entry(pteval)); } - /* Set the dirty flag on the folio now the pte is gone. */ - if (pte_dirty(pteval)) - folio_mark_dirty(folio); + VM_WARN_ON_FOLIO(writable && folio_test_anon(folio) && + !anon_exclusive, folio); /* Update high watermark before we lower rss */ update_hiwater_rss(mm); - if (folio_is_device_private(folio)) { - unsigned long pfn = folio_pfn(folio); - swp_entry_t entry; - pte_t swp_pte; - - if (anon_exclusive) - WARN_ON_ONCE(folio_try_share_anon_rmap_pte(folio, - subpage)); - - /* - * Store the pfn of the page in a special migration - * pte. do_swap_page() will wait until the migration - * pte is removed and then restart fault handling. - */ - entry = pte_to_swp_entry(pteval); - if (is_writable_device_private_entry(entry)) - entry = make_writable_migration_entry(pfn); - else if (anon_exclusive) - entry = make_readable_exclusive_migration_entry(pfn); - else - entry = make_readable_migration_entry(pfn); - swp_pte = swp_entry_to_pte(entry); + if (PageHWPoison(subpage)) { + VM_WARN_ON_FOLIO(folio_is_device_private(folio), folio); - /* - * pteval maps a zone device page and is therefore - * a swap pte. - */ - if (pte_swp_soft_dirty(pteval)) - swp_pte = pte_swp_mksoft_dirty(swp_pte); - if (pte_swp_uffd_wp(pteval)) - swp_pte = pte_swp_mkuffd_wp(swp_pte); - set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); - trace_set_migration_pte(pvmw.address, pte_val(swp_pte), - folio_order(folio)); - /* - * No need to invalidate here it will synchronize on - * against the special swap migration pte. - */ - } else if (PageHWPoison(subpage)) { pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); if (folio_test_hugetlb(folio)) { hugetlb_count_sub(folio_nr_pages(folio), mm); @@ -2256,8 +2223,8 @@ static bool try_to_migrate_one(struct fo dec_mm_counter(mm, mm_counter(folio)); set_pte_at(mm, address, pvmw.pte, pteval); } - - } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) { + } else if (likely(pte_present(pteval)) && pte_unused(pteval) && + !userfaultfd_armed(vma)) { /* * The guest indicated that the page content is of no * interest anymore. Simply discard the pte, vmscan @@ -2273,6 +2240,11 @@ static bool try_to_migrate_one(struct fo swp_entry_t entry; pte_t swp_pte; + /* + * arch_unmap_one() is expected to be a NOP on + * architectures where we could have PFN swap PTEs, + * so we'll not check/care. + */ if (arch_unmap_one(mm, vma, address, pteval) < 0) { if (folio_test_hugetlb(folio)) set_huge_pte_at(mm, address, pvmw.pte, @@ -2283,8 +2255,6 @@ static bool try_to_migrate_one(struct fo page_vma_mapped_walk_done(&pvmw); break; } - VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_anon(folio) && - !anon_exclusive, subpage); /* See folio_try_share_anon_rmap_pte(): clear PTE first. */ if (folio_test_hugetlb(folio)) { @@ -2309,7 +2279,7 @@ static bool try_to_migrate_one(struct fo * pte. do_swap_page() will wait until the migration * pte is removed and then restart fault handling. */ - if (pte_write(pteval)) + if (writable) entry = make_writable_migration_entry( page_to_pfn(subpage)); else if (anon_exclusive) @@ -2318,15 +2288,23 @@ static bool try_to_migrate_one(struct fo else entry = make_readable_migration_entry( page_to_pfn(subpage)); - if (pte_young(pteval)) - entry = make_migration_entry_young(entry); - if (pte_dirty(pteval)) - entry = make_migration_entry_dirty(entry); - swp_pte = swp_entry_to_pte(entry); - if (pte_soft_dirty(pteval)) - swp_pte = pte_swp_mksoft_dirty(swp_pte); - if (pte_uffd_wp(pteval)) - swp_pte = pte_swp_mkuffd_wp(swp_pte); + if (likely(pte_present(pteval))) { + if (pte_young(pteval)) + entry = make_migration_entry_young(entry); + if (pte_dirty(pteval)) + entry = make_migration_entry_dirty(entry); + swp_pte = swp_entry_to_pte(entry); + if (pte_soft_dirty(pteval)) + swp_pte = pte_swp_mksoft_dirty(swp_pte); + if (pte_uffd_wp(pteval)) + swp_pte = pte_swp_mkuffd_wp(swp_pte); + } else { + swp_pte = swp_entry_to_pte(entry); + if (pte_swp_soft_dirty(pteval)) + swp_pte = pte_swp_mksoft_dirty(swp_pte); + if (pte_swp_uffd_wp(pteval)) + swp_pte = pte_swp_mkuffd_wp(swp_pte); + } if (folio_test_hugetlb(folio)) set_huge_pte_at(mm, address, pvmw.pte, swp_pte, hsz); _ Patches currently in -mm which might be from david@xxxxxxxxxx are mm-gup-reject-foll_split_pmd-with-hugetlb-vmas.patch mm-rmap-reject-hugetlb-folios-in-folio_make_device_exclusive.patch mm-rmap-convert-make_device_exclusive_range-to-make_device_exclusive.patch mm-rmap-implement-make_device_exclusive-using-folio_walk-instead-of-rmap-walk.patch mm-memory-detect-writability-in-restore_exclusive_pte-through-can_change_pte_writable.patch mm-use-single-swp_device_exclusive-entry-type.patch mm-page_vma_mapped-device-exclusive-entries-are-not-migration-entries.patch kernel-events-uprobes-handle-device-exclusive-entries-correctly-in-__replace_page.patch mm-ksm-handle-device-exclusive-entries-correctly-in-write_protect_page.patch mm-rmap-handle-device-exclusive-entries-correctly-in-try_to_unmap_one.patch mm-rmap-handle-device-exclusive-entries-correctly-in-try_to_migrate_one.patch mm-rmap-handle-device-exclusive-entries-correctly-in-page_vma_mkclean_one.patch mm-page_idle-handle-device-exclusive-entries-correctly-in-page_idle_clear_pte_refs_one.patch mm-damon-handle-device-exclusive-entries-correctly-in-damon_folio_young_one.patch mm-damon-handle-device-exclusive-entries-correctly-in-damon_folio_mkold_one.patch mm-rmap-keep-mapcount-untouched-for-device-exclusive-entries.patch mm-rmap-avoid-ebusy-from-make_device_exclusive.patch