Re: [PATCH 1/3] mm: Move arch_do_swap_page() call to before swap_free()

Peter Collingbourne <pcc@xxxxxxxxxx> · Mon, 15 May 2023 16:40:01 -0700

On Mon, May 15, 2023 at 06:34:30PM +0100, Catalin Marinas wrote:
> On Sat, May 13, 2023 at 05:29:53AM +0200, David Hildenbrand wrote:
> > On 13.05.23 01:57, Peter Collingbourne wrote:
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 01a23ad48a04..83268d287ff1 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3914,19 +3914,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >   		}
> > >   	}
> > > -	/*
> > > -	 * Remove the swap entry and conditionally try to free up the swapcache.
> > > -	 * We're already holding a reference on the page but haven't mapped it
> > > -	 * yet.
> > > -	 */
> > > -	swap_free(entry);
> > > -	if (should_try_to_free_swap(folio, vma, vmf->flags))
> > > -		folio_free_swap(folio);
> > > -
> > > -	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > > -	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > >   	pte = mk_pte(page, vma->vm_page_prot);
> > > -
> > >   	/*
> > >   	 * Same logic as in do_wp_page(); however, optimize for pages that are
> > >   	 * certainly not shared either because we just allocated them without
> > > @@ -3946,8 +3934,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >   		pte = pte_mksoft_dirty(pte);
> > >   	if (pte_swp_uffd_wp(vmf->orig_pte))
> > >   		pte = pte_mkuffd_wp(pte);
> > > +	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > >   	vmf->orig_pte = pte;
> > > +	/*
> > > +	 * Remove the swap entry and conditionally try to free up the swapcache.
> > > +	 * We're already holding a reference on the page but haven't mapped it
> > > +	 * yet.
> > > +	 */
> > > +	swap_free(entry);
> > > +	if (should_try_to_free_swap(folio, vma, vmf->flags))
> > > +		folio_free_swap(folio);
> > > +
> > > +	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> > > +	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
> > > +
> > >   	/* ksm created a completely new copy */
> > >   	if (unlikely(folio != swapcache && swapcache)) {
> > >   		page_add_new_anon_rmap(page, vma, vmf->address);
> > > @@ -3959,7 +3960,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >   	VM_BUG_ON(!folio_test_anon(folio) ||
> > >   			(pte_write(pte) && !PageAnonExclusive(page)));
> > >   	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> > > -	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> > >   	folio_unlock(folio);
> > >   	if (folio != swapcache && swapcache) {
> > 
> > 
> > You are moving the folio_free_swap() call after the folio_ref_count(folio)
> > == 1 check, which means that such (previously) swapped pages that are
> > exclusive cannot be detected as exclusive.
> > 
> > There must be a better way to handle MTE here.
> > 
> > Where are the tags stored, how is the location identified, and when are they
> > effectively restored right now?
> 
> I haven't gone through Peter's patches yet but a pretty good description
> of the problem is here:
> https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@xxxxxxxxxxxx/.
> I couldn't reproduce it with my swap setup but both Qun-wei and Peter
> triggered it.

In order to reproduce this bug it is necessary for the swap slot cache
to be disabled, which is unlikely to occur during normal operation. I
was only able to reproduce the bug by disabling it forcefully with the
following patch:

diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e0..25afba16980c7 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -79,7 +79,7 @@ void disable_swap_slots_cache_lock(void)
 
 static void __reenable_swap_slots_cache(void)
 {
-	swap_slot_cache_enabled = has_usable_swap();
+	swap_slot_cache_enabled = false;
 }
 
 void reenable_swap_slots_cache_unlock(void)

With that I can trigger the bug on an MTE-utilizing process by running
a program that enumerates the process's private anonymous mappings and
calls process_madvise(MADV_PAGEOUT) on all of them.

> When a tagged page is swapped out, the arm64 code stores the metadata
> (tags) in a local xarray indexed by the swap pte. When restoring from
> swap, the arm64 set_pte_at() checks this xarray using the old swap pte
> and spills the tags onto the new page. Apparently something changed in
> the kernel recently that causes swap_range_free() to be called before
> set_pte_at(). The arm64 arch_swap_invalidate_page() frees the metadata
> from the xarray and the subsequent set_pte_at() won't find it.
> 
> If we have the page, the metadata can be restored before set_pte_at()
> and I guess that's what Peter is trying to do (again, I haven't looked
> at the details yet; leaving it for tomorrow).
> 
> Is there any other way of handling this? E.g. not release the metadata
> in arch_swap_invalidate_page() but later in set_pte_at() once it was
> restored. But then we may leak this metadata if there's no set_pte_at()
> (the process mapping the swap entry died).

Another problem that I can see with this approach is that it does not
respect reference counts for swap entries, and it's unclear whether that
can be done in a non-racy fashion.

Another approach that I considered was to move the hook to swap_readpage()
as in the patch below (sorry, it only applies to an older version
of Android's android14-6.1 branch and not mainline, but you get the
idea). But during a stress test (running the aforementioned program that
calls process_madvise(MADV_PAGEOUT) in a loop during an Android "monkey"
test) I discovered the following racy use-after-free that can occur when
two tasks T1 and T2 concurrently restore the same page:

T1:                  | T2:
arch_swap_readpage() |
                     | arch_swap_readpage() -> mte_restore_tags() -> xe_load()
swap_free()          |
                     | arch_swap_readpage() -> mte_restore_tags() -> mte_restore_page_tags()

We can avoid it by taking the swap_info_struct::lock spinlock in
mte_restore_tags(), but it seems like it would lead to lock contention.

Peter

diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 3f8199ba265a1..99c8be073f107 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -25,7 +25,7 @@ unsigned long mte_copy_tags_to_user(void __user *to, void *from,
 				    unsigned long n);
 int mte_save_tags(struct page *page);
 void mte_save_page_tags(const void *page_addr, void *tag_storage);
-bool mte_restore_tags(swp_entry_t entry, struct page *page);
+void mte_restore_tags(struct page *page);
 void mte_restore_page_tags(void *page_addr, const void *tag_storage);
 void mte_invalidate_tags(int type, pgoff_t offset);
 void mte_invalidate_tags_area(int type);
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 812373cff4eec..32d3c661a0eee 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1054,11 +1054,11 @@ static inline void arch_swap_invalidate_area(int type)
 		mte_invalidate_tags_area(type);
 }
 
-#define __HAVE_ARCH_SWAP_RESTORE
-static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
+#define __HAVE_ARCH_SWAP_READPAGE
+static inline void arch_swap_readpage(struct page *page)
 {
-	if (system_supports_mte() && mte_restore_tags(entry, &folio->page))
-		set_page_mte_tagged(&folio->page);
+	if (system_supports_mte())
+  		mte_restore_tags(page);
 }
 
 #endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 84a085d536f84..176f094ecaa1e 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -38,15 +38,6 @@ EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
 static void mte_sync_page_tags(struct page *page, pte_t old_pte,
 			       bool check_swap, bool pte_is_tagged)
 {
-	if (check_swap && is_swap_pte(old_pte)) {
-		swp_entry_t entry = pte_to_swp_entry(old_pte);
-
-		if (!non_swap_entry(entry) && mte_restore_tags(entry, page)) {
-			set_page_mte_tagged(page);
-			return;
-		}
-	}
-
 	if (!pte_is_tagged)
 		return;
 
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index 70f913205db99..3fe7774f32b3c 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -46,21 +46,23 @@ int mte_save_tags(struct page *page)
 	return 0;
 }
 
-bool mte_restore_tags(swp_entry_t entry, struct page *page)
+void mte_restore_tags(struct page *page)
 {
+	swp_entry_t entry = folio_swap_entry(page_folio(page));
 	void *tags = xa_load(&mte_pages, entry.val);
 
 	if (!tags)
-		return false;
+		return;
 
 	/*
 	 * Test PG_mte_tagged again in case it was racing with another
 	 * set_pte_at().
 	 */
-	if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+	if (!test_and_set_bit(PG_mte_tagged, &page->flags)) {
 		mte_restore_page_tags(page_address(page), tags);
-
-	return true;
+		if (kasan_hw_tags_enabled())
+			page_kasan_tag_reset(page);
+	}
 }
 
 void mte_invalidate_tags(int type, pgoff_t offset)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5f0d7d0b9471b..eea1e545595ca 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -793,8 +793,8 @@ static inline void arch_swap_invalidate_area(int type)
 }
 #endif
 
-#ifndef __HAVE_ARCH_SWAP_RESTORE
-static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
+#ifndef __HAVE_ARCH_SWAP_READPAGE
+static inline void arch_swap_readpage(struct page *page)
 {
 }
 #endif
diff --git a/mm/page_io.c b/mm/page_io.c
index 3a5f921b932e8..a2f53dbeca7b3 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -470,6 +470,12 @@ int swap_readpage(struct page *page, bool synchronous,
 	}
 	delayacct_swapin_start();
 
+	/*
+	 * Some architectures may have to restore extra metadata to the
+	 * page when reading from swap.
+	 */
+	arch_swap_readpage(page);
+
 	if (frontswap_load(page) == 0) {
 		SetPageUptodate(page);
 		unlock_page(page);
diff --git a/mm/shmem.c b/mm/shmem.c
index 0b335607bf2ad..82ccf1e6efe5d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1784,12 +1784,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 	folio_wait_writeback(folio);
 
-	/*
-	 * Some architectures may have to restore extra metadata to the
-	 * folio after reading from swap.
-	 */
-	arch_swap_restore(swap, folio);
-
 	if (shmem_should_replace_folio(folio, gfp)) {
 		error = shmem_replace_folio(&folio, gfp, info, index);
 		if (error)