Re: [PATCH v1 00/15] mm: COW fixes part 2: reliable GUP pins of anonymous pages

David Hildenbrand <david@xxxxxxxxxx> · Tue, 8 Mar 2022 15:19:52 +0100

On 08.03.22 15:14, David Hildenbrand wrote:
> This series is the result of the discussion on the previous approach [2].
> More information on the general COW issues can be found there. It is based
> on v5.17-rc7 and [1], which resides in -mm and -next:
> 	[PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for
> 	THP and swap
> 
> v1 is located at:
> 	https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2_v1
> 
> This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
> on an anonymous page and COW logic fails to detect exclusivity of the page
> to then replacing the anonymous page by a copy in the page table: The
> GUP pin lost synchronicity with the pages mapped into the page tables.
> 
> This issue, including other related COW issues, has been summarized in [3]
> under 3):
> "
>   3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
> 
>   page_maybe_dma_pinned() is used to check if a page may be pinned for
>   DMA (using FOLL_PIN instead of FOLL_GET). While false positives are
>   tolerable, false negatives are problematic: pages that are pinned for
>   DMA must not be added to the swapcache. If it happens, the (now pinned)
>   page could be faulted back from the swapcache into page tables
>   read-only. Future write-access would detect the pinning and COW the
>   page, losing synchronicity. For the interested reader, this is nicely
>   documented in feb889fb40fa ("mm: don't put pinned pages into the swap
>   cache").
> 
>   Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
>   cases and can result in a violation of the documented semantics:
>   giving false negatives because of the race.
> 
>   There are cases where we call it without properly taking a per-process
>   sequence lock, turning the usage of page_maybe_dma_pinned() racy. While
>   one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
>   handle, there is especially one rmap case (shrink_page_list) that's hard
>   to fix: in the rmap world, we're not limited to a single process.
> 
>   The shrink_page_list() issue is really subtle. If we race with
>   someone pinning a page, we can trigger the same issue as in the FOLL_GET
>   case. See the detail section at the end of this mail on a discussion how
>   bad this can bite us with VFIO or other FOLL_PIN user.
> 
>   It's harder to reproduce, but I managed to modify the O_DIRECT
>   reproducer to use io_uring fixed buffers [15] instead, which ends up
>   using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
>   similarly trigger a loss of synchronicity and consequently a memory
>   corruption.
> 
>   Again, the root issue is that a write-fault on a page that has
>   additional references results in a COW and thereby a loss of
>   synchronicity and consequently a memory corruption if two parties
>   believe they are referencing the same page.
> "
> 
> This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable,
> especially also taking care of concurrent pinning via GUP-fast,
> for example, also fully fixing an issue reported regarding NUMA
> balancing [4] recently. While doing that, it further reduces "unnecessary
> COWs", especially when we don't fork()/KSM and don't swapout, and fixes the
> COW security for hugetlb for FOLL_PIN.
> 
> In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
> anonymous page is exclusive. Exclusive anonymous pages that are mapped
> R/O can directly be mapped R/W by the COW logic in the write fault handler.
> Exclusive anonymous pages that want to be shared (fork(), KSM) first have
> to mark a mapped anonymous page shared -- which will fail if there are
> GUP pins on the page. GUP is only allowed to take a pin on anonymous pages
> that is exclusive. The PT lock is the primary mechanism to synchronize
> modifications of PG_anon_exclusive. GUP-fast is synchronized either via the
> src_mm->write_protect_seq or via clear/invalidate+flush of the relevant
> page table entry.
> 
> Special care has to be taken about swap, migration, and THPs (whereby a
> PMD-mapping can be converted to a PTE mapping and we have to track
> information for subpages). Besides these, we let the rmap code handle most
> magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE
> logic as part of our previous approach [2], however, it's now 100% mapcount
> free and I further simplified it a bit.
> 
>   #1 is a fix
>   #3-#9 are mostly rmap preparations for PG_anon_exclusive handling
>   #10 introduces PG_anon_exclusive
>   #11 uses PG_anon_exclusive and make R/W pins of anonymous pages
>    reliable
>   #12 is a preparation for reliable R/O pins
>   #13 and #14 is reused/modified GUP-triggered unsharing for R/O GUP pins
>    make R/O pins of anonymous pages reliable
>   #15 adds sanity check when (un)pinning anonymous pages
> 
> I'm not proud about patch #10, suggestions welcome. Patch #11 contains
> excessive explanations and the main logic for R/W pins. #12 and #13
> resemble what we proposed in the previous approach [2]. I consider the
> general approach of #15 very nice and helpful, and I remember Linus even
> envisioning something like that for finding BUGs, although we might want to
> implement the sanity checks eventually differently
> 
> It passes my growing set of tests for "wrong COW" and "missed COW",
> including the ones in [3] -- I'd really appreciate some experienced eyes
> to take a close look at corner cases.
> 
> I'm planning on sending a part 3 that will remember PG_anon_exclusive for
> ordinary swap entries: this will make FOLL_GET | FOLL_WRITE references
> reliable and fix the memory corruptions for O_DIRECT -- as described in
> [3] under 2) -- as well, as long as there is no fork().
> 
> The long term goal should be to convert relevant users of FOLL_GET to
> FOLL_PIN, however, with part3 it wouldn't be required to fix the obvious
> memory corruptions we are aware of. Once that's in place we can streamline
> our COW logic for hugetlb to rely on page_count() as well and fix any
> possible COW security issues.
> 
> [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@xxxxxxxxxx
> [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@xxxxxxxxxx
> [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@xxxxxxxxxx
> [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616
> 
> 
> RFC -> v1:
> * Rephrased/extended some patch descriptions+comments
> * Tested on aarch64, ppc64 and x86_64
> * "mm/rmap: convert RMAP flags to a proper distinct rmap_t type"
>  -> Added
> * "mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()"
>  -> Added
> * "mm: remember exclusively mapped anonymous pages with PG_anon_exclusive"
>  -> Fixed __do_huge_pmd_anonymous_page() to recheck after temporarily
>     dropping the PT lock.
>  -> Use "reuse" label in __do_huge_pmd_anonymous_page()
>  -> Slightly simplify logic in hugetlb_cow()
>  -> In remove_migration_pte(), remove unrelated changes around
>     page_remove_rmap()
> * "mm: support GUP-triggered unsharing of anonymous pages"
>  -> In handle_pte_fault(), trigger pte_mkdirty() only with
>     FAULT_FLAG_WRITE
>  -> In __handle_mm_fault(), extend comment regarding anonymous PUDs
> * "mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
>    anonymous page"
>    -> Added unsharing logic to gup_hugepte() and gup_huge_pud()
>    -> Changed return logic in __follow_hugetlb_must_fault(), making sure
>       that "unshare" is always set
> * "mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
>    exclusive when (un)pinning"
>   -> Slightly simplified sanity_check_pinned_pages()
> 
> David Hildenbrand (15):
>   mm/rmap: fix missing swap_free() in try_to_unmap() after
>     arch_unmap_one() failed
>   mm/hugetlb: take src_mm->write_protect_seq in
>     copy_hugetlb_page_range()
>   mm/memory: slightly simplify copy_present_pte()
>   mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and
>     page_try_dup_anon_rmap()
>   mm/rmap: convert RMAP flags to a proper distinct rmap_t type
>   mm/rmap: remove do_page_add_anon_rmap()
>   mm/rmap: pass rmap flags to hugepage_add_anon_rmap()
>   mm/rmap: drop "compound" parameter from page_add_new_anon_rmap()
>   mm/rmap: use page_move_anon_rmap() when reusing a mapped PageAnon()
>     page exclusively
>   mm/page-flags: reuse PG_slab as PG_anon_exclusive for PageAnon() pages
>   mm: remember exclusively mapped anonymous pages with PG_anon_exclusive
>   mm/gup: disallow follow_page(FOLL_PIN)
>   mm: support GUP-triggered unsharing of anonymous pages
>   mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared
>     anonymous page
>   mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are
>     exclusive when (un)pinning
> 
>  fs/proc/page.c                 |   3 +-
>  include/linux/mm.h             |  46 ++++++-
>  include/linux/mm_types.h       |   8 ++
>  include/linux/page-flags.h     | 124 +++++++++++++++++-
>  include/linux/rmap.h           | 109 ++++++++++++++--
>  include/linux/swap.h           |  15 ++-
>  include/linux/swapops.h        |  25 ++++
>  include/trace/events/mmflags.h |   2 +-
>  kernel/events/uprobes.c        |   2 +-
>  mm/gup.c                       | 103 ++++++++++++++-
>  mm/huge_memory.c               | 122 +++++++++++++-----
>  mm/hugetlb.c                   | 137 ++++++++++++++------
>  mm/khugepaged.c                |   2 +-
>  mm/ksm.c                       |  15 ++-
>  mm/memory-failure.c            |  24 +++-
>  mm/memory.c                    | 221 ++++++++++++++++++++-------------
>  mm/memremap.c                  |  11 ++
>  mm/migrate.c                   |  40 +++++-
>  mm/mprotect.c                  |   8 +-
>  mm/page_alloc.c                |  13 ++
>  mm/rmap.c                      |  95 ++++++++++----
>  mm/swapfile.c                  |   4 +-
>  mm/userfaultfd.c               |   2 +-
>  23 files changed, 904 insertions(+), 227 deletions(-)
> 

Full diff to RFC:

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c4957bac4868..1bc522d28a78 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -158,23 +158,37 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *page_get_anon_vma(struct page *page);
 
-/* bitflags for do_page_add_anon_rmap() */
-#define RMAP_EXCLUSIVE 0x01
-#define RMAP_COMPOUND 0x02
+/* RMAP flags, currently only relevant for some anon rmap operations. */
+typedef int __bitwise rmap_t;
+
+/*
+ * No special request: if the page is a subpage of a compound page, it is
+ * mapped via a PTE. The mapped (sub)page is possibly shared between processes.
+ */
+#define RMAP_NONE		((__force rmap_t)0)
+
+/* The (sub)page is exclusive to a single process. */
+#define RMAP_EXCLUSIVE		((__force rmap_t)BIT(0))
+
+/*
+ * The compound page is not mapped via PTEs, but instead via a single PMD and
+ * should be accounted accordingly.
+ */
+#define RMAP_COMPOUND		((__force rmap_t)BIT(1))
 
 /*
  * rmap interfaces called when adding or removing pte of page
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *);
 void page_add_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, int);
+		unsigned long, rmap_t);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-		unsigned long, bool);
+		unsigned long);
 void page_add_file_rmap(struct page *, bool);
 void page_remove_rmap(struct page *, bool);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
-			    unsigned long, int);
+			    unsigned long, rmap_t);
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 				unsigned long);
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..b6fdb23fb3ea 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -184,7 +184,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	if (new_page) {
 		get_page(new_page);
-		page_add_new_anon_rmap(new_page, vma, addr, false);
+		page_add_new_anon_rmap(new_page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 	} else
 		/* no new page, just dec_mm_counter for old_page */
diff --git a/mm/gup.c b/mm/gup.c
index df1ae29a1f5f..b36f02f2b720 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -50,30 +50,27 @@ static inline void sanity_check_pinned_pages(struct page **pages,
 {
 #ifdef CONFIG_DEBUG_VM
 	/*
-	 * We expect to only succeed in pinning anonymous pages if they are
-	 * exclusive. Once pinned, we can no longer mark them shared and the
-	 * PageAnonExclusive() will stick around until the page is freed
-	 * after it was unmapped.
+	 * We only pin anonymous pages if they are exclusive. Once pinned, we
+	 * can no longer turn them possibly shared and PageAnonExclusive() will
+	 * stick around until the page is freed.
 	 *
-	 * We'd like to verify that our pinned anonymous pages are mapped
+	 * We'd like to verify that our pinned anonymous pages are still mapped
 	 * exclusively. The issue with anon THP is that we don't know how
 	 * they are/were mapped when pinning them. However, for anon
 	 * THP we can assume that either the given page (PTE-mapped THP) or
 	 * the head page (PMD-mapped THP) should be PageAnonExclusive(). If
-	 * neither is the case there is certainly something wrong.
+	 * neither is the case, there is certainly something wrong.
 	 */
 	for (; npages; npages--, pages++) {
 		struct page *page = *pages;
 		struct page *head = compound_head(page);
 
-		if (!PageAnon(page))
+		if (!PageAnon(head))
 			continue;
-		if (!PageCompound(page))
-			VM_BUG_ON_PAGE(!PageAnonExclusive(page), page);
-		else if (PageHuge(page))
+		if (!PageCompound(head) || PageHuge(head))
 			VM_BUG_ON_PAGE(!PageAnonExclusive(head), page);
 		else
-			/* Either PTE-mapped or PMD-mapped, we don't know. */
+			/* Either a PTE-mapped or a PMD-mapped THP. */
 			VM_BUG_ON_PAGE(!PageAnonExclusive(head) &&
 				       !PageAnonExclusive(page), page);
 	}
@@ -449,7 +446,7 @@ static void unpin_user_pages_lockless(struct page **pages, unsigned long npages)
 
 	/*
 	 * Don't perform any sanity checks because we might have raced with
-	 * fork() and some anonymous pages might no actually be shared --
+	 * fork() and some anonymous pages might now actually be shared --
 	 * which is why we're unpinning after all.
 	 */
 	for_each_compound_head(index, pages, npages, head, ntails)
@@ -2607,6 +2604,11 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
+	if (!pte_write(pte) && gup_must_unshare(flags, head)) {
+		put_compound_head(head, refs, flags);
+		return 0;
+	}
+
 	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
@@ -2706,6 +2708,11 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 	}
 
+	if (!pud_write(orig) && gup_must_unshare(flags, head)) {
+		put_compound_head(head, refs, flags);
+		return 0;
+	}
+
 	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 72ed42d0ef3c..d9559341a5f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -647,7 +647,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr, true);
+		page_add_new_anon_rmap(page, vma, haddr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
 		set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
@@ -1296,22 +1296,9 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
-	if (PageAnonExclusive(page)) {
-		pmd_t entry;
-
-		/* R/O and exclusive, nothing to do. */
-		if (unlikely(unshare)) {
-			spin_unlock(vmf->ptl);
-			return 0;
-		}
-
-		entry = pmd_mkyoung(orig_pmd);
-		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
-			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
-		spin_unlock(vmf->ptl);
-		return VM_FAULT_WRITE;
-	}
+	/* Early check when only holding the PT lock. */
+	if (PageAnonExclusive(page))
+		goto reuse;
 
 	if (!trylock_page(page)) {
 		get_page(page);
@@ -1327,6 +1314,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		put_page(page);
 	}
 
+	/* Recheck after temporarily dropping the PT lock. */
+	if (PageAnonExclusive(page)) {
+		unlock_page(page);
+		goto reuse;
+	}
+
 	/*
 	 * See do_wp_page(): we can only reuse the page exclusively if there are
 	 * no additional references. Note that we always drain the LRU
@@ -1340,8 +1333,9 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		pmd_t entry;
 
 		page_move_anon_rmap(page, vma);
+		unlock_page(page);
+reuse:
 		if (unlikely(unshare)) {
-			unlock_page(page);
 			spin_unlock(vmf->ptl);
 			return 0;
 		}
@@ -1349,7 +1343,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
-		unlock_page(page);
 		spin_unlock(vmf->ptl);
 		return VM_FAULT_WRITE;
 	}
@@ -2376,8 +2369,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 * After successful get_page_unless_zero() might follow flags change,
 	 * for example lock_page() which set PG_waiters.
 	 *
-	 * Keep PG_anon_exclusive information already maintained for all
-	 * subpages of a compound page untouched.
+	 * Keep PG_anon_exclusive information, already maintained for all
+	 * subpages of a compound page, untouched.
 	 */
 	page_tail->flags &= ~(PAGE_FLAGS_CHECK_AT_PREP & ~PG_anon_exclusive);
 	page_tail->flags |= (head->flags &
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7bac6ed93fea..0d150d100111 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5169,14 +5169,14 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * owner and can reuse this page.
 	 */
 	if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
-		if (unlikely(unshare) && PageAnonExclusive(old_page))
-			return 0;
-		page_move_anon_rmap(old_page, vma);
+		if (!PageAnonExclusive(old_page))
+			page_move_anon_rmap(old_page, vma);
 		if (likely(!unshare))
 			set_huge_ptep_writable(vma, haddr, ptep);
 		return 0;
 	}
-	VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page), old_page);
+	VM_BUG_ON_PAGE(PageAnon(old_page) && PageAnonExclusive(old_page),
+		       old_page);
 
 	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
@@ -5968,9 +5968,11 @@ static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
 		return false;
 	if (flags & FOLL_WRITE)
 		return true;
-	if (gup_must_unshare(flags, pte_page(pteval)))
+	if (gup_must_unshare(flags, pte_page(pteval))) {
 		*unshare = true;
-	return *unshare;
+		return true;
+	}
+	return false;
 }
 
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -5987,7 +5989,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
 		spinlock_t *ptl = NULL;
-		bool unshare;
+		bool unshare = false;
 		int absent;
 		struct page *page;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a325a646be33..96cc903c4788 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1183,7 +1183,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address, true);
+	page_add_new_anon_rmap(new_page, vma, address);
 	lru_cache_add_inactive_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
diff --git a/mm/ksm.c b/mm/ksm.c
index c380c36f3ebd..350b46432d65 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1164,7 +1164,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	 */
 	if (!is_zero_pfn(page_to_pfn(kpage))) {
 		get_page(kpage);
-		page_add_anon_rmap(kpage, vma, addr, 0);
+		page_add_anon_rmap(kpage, vma, addr, RMAP_NONE);
 		newpte = mk_pte(kpage, vma->vm_page_prot);
 	} else {
 		newpte = pte_mkspecial(pfn_pte(page_to_pfn(kpage),
diff --git a/mm/memory.c b/mm/memory.c
index 33d046c34819..94355bb34d8c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -899,7 +899,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	*prealloc = NULL;
 	copy_user_highpage(new_page, page, addr, src_vma);
 	__SetPageUptodate(new_page);
-	page_add_new_anon_rmap(new_page, dst_vma, addr, false);
+	page_add_new_anon_rmap(new_page, dst_vma, addr);
 	lru_cache_add_inactive_or_unevictable(new_page, dst_vma);
 	rss[mm_counter(new_page)]++;
 
@@ -2948,6 +2948,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 	struct page *page = vmf->page;
 	pte_t entry;
 
+	VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE));
 	VM_BUG_ON(PageAnon(page) && !PageAnonExclusive(page));
 
 	/*
@@ -3068,7 +3069,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * some TLBs while the old PTE remains in others.
 		 */
 		ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-		page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+		page_add_new_anon_rmap(new_page, vma, vmf->address);
 		lru_cache_add_inactive_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
@@ -3282,8 +3283,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 
 	vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
 	if (!vmf->page) {
-		/* No anonymous page -> nothing to do. */
 		if (unlikely(unshare)) {
+			/* No anonymous page -> nothing to do. */
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return 0;
 		}
@@ -3565,10 +3566,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct page *page = NULL, *swapcache;
 	struct swap_info_struct *si = NULL;
+	rmap_t rmap_flags = RMAP_NONE;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
-	int exclusive = 0;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
@@ -3744,7 +3745,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			vmf->flags &= ~FAULT_FLAG_WRITE;
 			ret |= VM_FAULT_WRITE;
 		}
-		exclusive = RMAP_EXCLUSIVE;
+		rmap_flags |= RMAP_EXCLUSIVE;
 	}
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(vmf->orig_pte))
@@ -3757,10 +3758,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	/* ksm created a completely new copy */
 	if (unlikely(page != swapcache && swapcache)) {
-		page_add_new_anon_rmap(page, vma, vmf->address, false);
+		page_add_new_anon_rmap(page, vma, vmf->address);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
-		page_add_anon_rmap(page, vma, vmf->address, exclusive);
+		page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
 	}
 
 	VM_BUG_ON(!PageAnon(page) || (pte_write(pte) && !PageAnonExclusive(page)));
@@ -3908,7 +3909,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	}
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, vmf->address, false);
+	page_add_new_anon_rmap(page, vma, vmf->address);
 	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -4085,7 +4086,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
 	/* copy-on-write page */
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, addr, false);
+		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
@@ -4682,7 +4683,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!pte_write(entry))
 			return do_wp_page(vmf);
-		entry = pte_mkdirty(entry);
+		else if (likely(vmf->flags & FAULT_FLAG_WRITE))
+			entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
 	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
@@ -4746,8 +4748,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		barrier();
 		if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
 
-			/* NUMA case for anonymous PUDs would go here */
-
+			/*
+			 * TODO once we support anonymous PUDs: NUMA case and
+			 * FAULT_FLAG_UNSHARE handling.
+			 */
 			if ((flags & FAULT_FLAG_WRITE) && !pud_write(orig_pud)) {
 				ret = wp_huge_pud(&vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
diff --git a/mm/migrate.c b/mm/migrate.c
index 63a3241041e1..7f440d2103ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -188,7 +188,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 	while (page_vma_mapped_walk(&pvmw)) {
-		int rmap_flags = 0;
+		rmap_t rmap_flags = RMAP_NONE;
 
 		if (PageKsm(page))
 			new = page;
@@ -2357,12 +2357,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			set_pte_at(mm, addr, ptep, swp_pte);
 
 			/*
-			 * This is like regular unmap: we remove the
-			 * rmap and drop page refcount. Page won't be
-			 * freed, as we took a reference just above.
+			 * This is like regular unmap: we remove the rmap and
+			 * drop page refcount. Page won't be freed, as we took
+			 * a reference just above.
 			 */
-			put_page(page);
 			page_remove_rmap(page, false);
+			put_page(page);
 
 			if (pte_present(pte))
 				unmapped++;
@@ -2753,7 +2753,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto unlock_abort;
 
 	inc_mm_counter(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, addr, false);
+	page_add_new_anon_rmap(page, vma, addr);
 	if (!is_zone_device_page(page))
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	get_page(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index cc54bd1cce61..9d2a7e11e8cc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1141,8 +1141,8 @@ static void __page_check_anon_rmap(struct page *page,
  * and to ensure that PageAnon is not being upgraded racily to PageKsm
  * (but PageKsm is never downgraded to PageAnon).
  */
-void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
-			unsigned long address, int flags)
+void page_add_anon_rmap(struct page *page,
+	struct vm_area_struct *vma, unsigned long address, rmap_t flags)
 {
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
@@ -1185,25 +1185,28 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
 		__page_set_anon_rmap(page, vma, address,
-				flags & RMAP_EXCLUSIVE);
+				     !!(flags & RMAP_EXCLUSIVE));
 	else
 		__page_check_anon_rmap(page, vma, address);
 }
 
 /**
- * page_add_new_anon_rmap - add pte mapping to a new anonymous page
+ * page_add_new_anon_rmap - add mapping to a new anonymous page
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
- * @compound:	charge the page as compound or small page
+ *
+ * If it's a compound page, it is accounted as a compound page. As the page
+ * is new, it's assume to get mapped exclusively by a single process.
  *
  * Same as page_add_anon_rmap but must only be called on *new* pages.
  * This means the inc-and-test can be bypassed.
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, bool compound)
+	struct vm_area_struct *vma, unsigned long address)
 {
+	const bool compound = PageCompound(page);
 	int nr = compound ? thp_nr_pages(page) : 1;
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
@@ -1639,6 +1642,16 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
+			/*
+			 * Note: We *don't* remember yet if the page was mapped
+			 * exclusively in the swap entry, so swapin code has
+			 * to re-determine that manually and might detect the
+			 * page as possibly shared, for example, if there are
+			 * other references on the page or if the page is under
+			 * writeback. We made sure that there are no GUP pins
+			 * on the page that would rely on it, so for GUP pins
+			 * this is fine.
+			 */
 			if (list_empty(&mm->mmlist)) {
 				spin_lock(&mmlist_lock);
 				if (list_empty(&mm->mmlist))
@@ -2450,7 +2463,7 @@ void rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc)
  * RMAP_COMPOUND is ignored.
  */
 void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
-			    unsigned long address, int flags)
+			    unsigned long address, rmap_t flags)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	int first;
@@ -2462,7 +2475,8 @@ void hugepage_add_anon_rmap(struct page *page, struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(!first && (flags & RMAP_EXCLUSIVE), page);
 	VM_BUG_ON_PAGE(!first && PageAnonExclusive(page), page);
 	if (first)
-		__page_set_anon_rmap(page, vma, address, flags & RMAP_EXCLUSIVE);
+		__page_set_anon_rmap(page, vma, address,
+				     flags & RMAP_EXCLUSIVE);
 }
 
 void hugepage_add_new_anon_rmap(struct page *page,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index eacda7ca7ac9..7edc8e099b22 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1800,9 +1800,9 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	get_page(page);
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr, 0);
+		page_add_anon_rmap(page, vma, addr, RMAP_NONE);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr, false);
+		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	}
 	set_pte_at(vma->vm_mm, addr, pte,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..4ca854ce14f0 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -98,7 +98,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	if (page_in_cache)
 		page_add_file_rmap(page, false);
 	else
-		page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+		page_add_new_anon_rmap(page, dst_vma, dst_addr);
 
 	/*
 	 * Must happen after rmap, as mm_counter() checks mapping (via

-- 
Thanks,

David / dhildenb