On 29.03.22 18:04, David Hildenbrand wrote: > This series is the result of the discussion on the previous approach [2]. > More information on the general COW issues can be found there. It is based > on latest linus/master (post v5.17, with relevant core-MM changes for > v5.18-rc1). > > v3 is located at: > https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v3 > > > This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken > on an anonymous page and COW logic fails to detect exclusivity of the page > to then replacing the anonymous page by a copy in the page table: The > GUP pin lost synchronicity with the pages mapped into the page tables. > > This issue, including other related COW issues, has been summarized in [3] > under 3): > " > 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) > > page_maybe_dma_pinned() is used to check if a page may be pinned for > DMA (using FOLL_PIN instead of FOLL_GET). While false positives are > tolerable, false negatives are problematic: pages that are pinned for > DMA must not be added to the swapcache. If it happens, the (now pinned) > page could be faulted back from the swapcache into page tables > read-only. Future write-access would detect the pinning and COW the > page, losing synchronicity. For the interested reader, this is nicely > documented in feb889fb40fa ("mm: don't put pinned pages into the swap > cache"). > > Peter reports [8] that page_maybe_dma_pinned() as used is racy in some > cases and can result in a violation of the documented semantics: > giving false negatives because of the race. > > There are cases where we call it without properly taking a per-process > sequence lock, turning the usage of page_maybe_dma_pinned() racy. While > one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to > handle, there is especially one rmap case (shrink_page_list) that's hard > to fix: in the rmap world, we're not limited to a single process. > > The shrink_page_list() issue is really subtle. If we race with > someone pinning a page, we can trigger the same issue as in the FOLL_GET > case. See the detail section at the end of this mail on a discussion how > bad this can bite us with VFIO or other FOLL_PIN user. > > It's harder to reproduce, but I managed to modify the O_DIRECT > reproducer to use io_uring fixed buffers [15] instead, which ends up > using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can > similarly trigger a loss of synchronicity and consequently a memory > corruption. > > Again, the root issue is that a write-fault on a page that has > additional references results in a COW and thereby a loss of > synchronicity and consequently a memory corruption if two parties > believe they are referencing the same page. > " > > This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, > especially also taking care of concurrent pinning via GUP-fast, > for example, also fully fixing an issue reported regarding NUMA > balancing [4] recently. While doing that, it further reduces "unnecessary > COWs", especially when we don't fork()/KSM and don't swapout, and fixes the > COW security for hugetlb for FOLL_PIN. > > In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped > anonymous page is exclusive. Exclusive anonymous pages that are mapped > R/O can directly be mapped R/W by the COW logic in the write fault handler. > Exclusive anonymous pages that want to be shared (fork(), KSM) first have > to be marked shared -- which will fail if there are GUP pins on the page. > GUP is only allowed to take a pin on anonymous pages that are exclusive. > The PT lock is the primary mechanism to synchronize modifications of > PG_anon_exclusive. We synchronize against GUP-fast either via the > src_mm->write_protect_seq (during fork()) or via clear/invalidate+flush of > the relevant page table entry. > > Special care has to be taken about swap, migration, and THPs (whereby a > PMD-mapping can be converted to a PTE mapping and we have to track > information for subpages). Besides these, we let the rmap code handle most > magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE > logic as part of our previous approach [2], however, it's now 100% mapcount > free and I further simplified it a bit. > > #1 is a fix > #3-#10 are mostly rmap preparations for PG_anon_exclusive handling > #11 introduces PG_anon_exclusive > #12 uses PG_anon_exclusive and make R/W pins of anonymous pages > reliable > #13 is a preparation for reliable R/O pins > #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins > make R/O pins of anonymous pages reliable > #16 adds sanity check when (un)pinning anonymous pages > > > [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@xxxxxxxxxx > [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@xxxxxxxxxx > [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@xxxxxxxxxx > [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616 > > > v2 -> v3: > * Note 1: Left the terminology "unshare" in place for now instead of > switching to "make anon exclusive". > * Note 2: We might have to tackle undoing effects of arch_unmap_one() on > sparc, to free some tag memory immediately instead of when tearing down > the vma/mm; looks like this needs more care either way, so I'll ignore it > for now. > * Rebased on top of core MM changes for v5.18-rc1 (most conflicts were due > to folio and ZONE_DEVICE migration rework). No severe changes were > necessary -- mostly folio conversion and code movement. > * Retested on aarch64, ppc64, s390x and x86_64 > * "mm/rmap: convert RMAP flags to a proper distinct rmap_t type" > -> Missed to convert one instance in restore_exclusive_pte() > * "mm/rmap: pass rmap flags to hugepage_add_anon_rmap()" > -> Use "!!(flags & RMAP_EXCLUSIVE)" to avoid sparse warnings > * "mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page()" > -> Added, as we can trigger that now more frequently > * "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive" > -> Use subpage in VM_BUG_ON_PAGE() in try_to_migrate_one() > -> Move comment from folio_migrate_mapping() to folio_migrate_flags() > regarding PG_anon_exclusive/PG_mappedtodisk > -> s/int rmap_flags/rmap_t rmap_flags/ in remove_migration_pmd() > * "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are > exclusive when (un)pinning" > -> Use IS_ENABLED(CONFIG_DEBUG_VM) instead of ifdef > > v1 -> v2: > * Tested on aarch64, ppc64, s390x and x86_64 > * "mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for PageAnon() > pages" > -> Use PG_mappedtodisk instead of PG_slab (thanks Willy!), this simlifies > the patch and necessary handling a lot. Add safety BUG_ON's > -> Move most documentation to the patch description, to be placed in a > proper documentation doc in the future, once everything's in place > * ""mm: remember exclusively mapped anonymous pages with PG_anon_exclusive > -> Skip check+clearing in page_try_dup_anon_rmap(), otherwise we might > trigger a wrong VM_BUG_ON() for KSM pages in ClearPageAnonExclusive() > -> In __split_huge_pmd_locked(), call page_try_share_anon_rmap() only > for "anon_exclusive", otherwise we might trigger a wrong VM_BUG_ON() > -> In __split_huge_page_tail(), drop any remaining PG_anon_exclusive on > tail pages, and document why that is fine > > RFC -> v1: > * Rephrased/extended some patch descriptions+comments > * Tested on aarch64, ppc64 and x86_64 > * "mm/rmap: convert RMAP flags to a proper distinct rmap_t type" > -> Added > * "mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()" > -> Added > * "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive" > -> Fixed __do_huge_pmd_anonymous_page() to recheck after temporarily > dropping the PT lock. > -> Use "reuse" label in __do_huge_pmd_anonymous_page() > -> Slightly simplify logic in hugetlb_cow() > -> In remove_migration_pte(), remove unrelated changes around > page_remove_rmap() > * "mm: support GUP-triggered unsharing of anonymous pages" > -> In handle_pte_fault(), trigger pte_mkdirty() only with > FAULT_FLAG_WRITE > -> In __handle_mm_fault(), extend comment regarding anonymous PUDs > * "mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared > anonymous page" > -> Added unsharing logic to gup_hugepte() and gup_huge_pud() > -> Changed return logic in __follow_hugetlb_must_fault(), making sure > that "unshare" is always set > * "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are > exclusive when (un)pinning" > -> Slightly simplified sanity_check_pinned_pages() > > David Hildenbrand (16): > mm/rmap: fix missing swap_free() in try_to_unmap() after > arch_unmap_one() failed > mm/hugetlb: take src_mm->write_protect_seq in > copy_hugetlb_page_range() > mm/memory: slightly simplify copy_present_pte() > mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and > page_try_dup_anon_rmap() > mm/rmap: convert RMAP flags to a proper distinct rmap_t type > mm/rmap: remove do_page_add_anon_rmap() > mm/rmap: pass rmap flags to hugepage_add_anon_rmap() > mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() > mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon() > page exclusively > mm/huge_memory: remove outdated VM_WARN_ON_ONCE_PAGE from unmap_page() > mm/page-flags: reuse PG_mappedtodisk as PG_anon_exclusive for > PageAnon() pages > mm: remember exclusively mapped anonymous pages with PG_anon_exclusive > mm/gup: disallow follow_page(FOLL_PIN) > mm: support GUP-triggered unsharing of anonymous pages > mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared > anonymous page > mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are > exclusive when (un)pinning > > include/linux/mm.h | 46 +++++++- > include/linux/mm_types.h | 8 ++ > include/linux/page-flags.h | 39 ++++++- > include/linux/rmap.h | 118 +++++++++++++++++-- > include/linux/swap.h | 15 ++- > include/linux/swapops.h | 25 ++++ > kernel/events/uprobes.c | 2 +- > mm/gup.c | 106 ++++++++++++++++- > mm/huge_memory.c | 127 +++++++++++++++----- > mm/hugetlb.c | 135 ++++++++++++++------- > mm/khugepaged.c | 2 +- > mm/ksm.c | 15 ++- > mm/memory.c | 234 +++++++++++++++++++++++-------------- > mm/memremap.c | 9 ++ > mm/migrate.c | 18 ++- > mm/migrate_device.c | 23 +++- > mm/mprotect.c | 8 +- > mm/rmap.c | 97 +++++++++++---- > mm/swapfile.c | 8 +- > mm/userfaultfd.c | 2 +- > tools/vm/page-types.c | 8 +- > 21 files changed, 825 insertions(+), 220 deletions(-) > > > base-commit: 1930a6e739c4b4a654a69164dbe39e554d228915 The effective change on v5.17 *before rebasing* compared to v2 is: diff --git a/mm/gup.c b/mm/gup.c index 72e39b77da10..e1de0104cb19 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -48,7 +48,9 @@ static void hpage_pincount_sub(struct page *page, int refs) static inline void sanity_check_pinned_pages(struct page **pages, unsigned long npages) { -#ifdef CONFIG_DEBUG_VM + if (!IS_ENABLED(CONFIG_DEBUG_VM)) + return; + /* * We only pin anonymous pages if they are exclusive. Once pinned, we * can no longer turn them possibly shared and PageAnonExclusive() will @@ -74,7 +76,6 @@ static inline void sanity_check_pinned_pages(struct page **pages, VM_BUG_ON_PAGE(!PageAnonExclusive(head) && !PageAnonExclusive(page), page); } -#endif /* CONFIG_DEBUG_VM */ } /* Equivalent to calling put_page() @refs times. */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0cc34addd911..c53764b0640f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2316,8 +2316,6 @@ static void unmap_page(struct page *page) try_to_migrate(page, ttu_flags); else try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK); - - VM_WARN_ON_ONCE_PAGE(page_mapped(page), page); } static void remap_page(struct page *page, unsigned int nr) @@ -3187,7 +3185,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE); if (PageAnon(new)) { - int rmap_flags = RMAP_COMPOUND; + rmap_t rmap_flags = RMAP_COMPOUND; if (!is_readable_migration_entry(entry)) rmap_flags |= RMAP_EXCLUSIVE; diff --git a/mm/memory.c b/mm/memory.c index b0b9d07a2850..1c8b52771799 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -727,7 +727,7 @@ static void restore_exclusive_pte(struct vm_area_struct *vma, * created when the swap entry was made. */ if (PageAnon(page)) - page_add_anon_rmap(page, vma, address, false); + page_add_anon_rmap(page, vma, address, RMAP_NONE); else /* * Currently device exclusive access only supports anonymous diff --git a/mm/migrate.c b/mm/migrate.c index 7f440d2103ce..013eb4f52fed 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -384,10 +384,6 @@ int folio_migrate_mapping(struct address_space *mapping, /* No turning back from here */ newfolio->index = folio->index; newfolio->mapping = folio->mapping; - /* - * Note: PG_anon_exclusive is always migrated via migration - * entries. - */ if (folio_test_swapbacked(folio)) __folio_set_swapbacked(newfolio); @@ -540,6 +536,12 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio) folio_set_workingset(newfolio); if (folio_test_checked(folio)) folio_set_checked(newfolio); + /* + * PG_anon_exclusive (-> PG_mappedtodisk) is always migrated via + * migration entries. We can still have PG_anon_exclusive set on an + * effectively unmapped and unreferenced first sub-pages of an + * anonymous THP: we can simply copy it here via PG_mappedtodisk. + */ if (folio_test_mappedtodisk(folio)) folio_set_mappedtodisk(newfolio); diff --git a/mm/rmap.c b/mm/rmap.c index 9d2a7e11e8cc..c18f6d7891d0 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1939,7 +1939,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, break; } VM_BUG_ON_PAGE(pte_write(pteval) && PageAnon(page) && - !anon_exclusive, page); + !anon_exclusive, subpage); if (anon_exclusive && page_try_share_anon_rmap(subpage)) { set_pte_at(mm, address, pvmw.pte, pteval); @@ -2476,7 +2476,7 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma, VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page); if (first) __page_set_anon_rmap(page, vma, address, - flags & RMAP_EXCLUSIVE); + !!(flags & RMAP_EXCLUSIVE)); } void hugepage_add_new_anon_rmap(struct page *page, -- Thanks, David / dhildenb