Re: [PATCH v3 1/2] mm/migrate: Fix read-only page got writable when recover pte

David Hildenbrand <david@xxxxxxxxxx> · Thu, 1 Dec 2022 16:42:52 +0100

On 01.12.22 16:28, Peter Xu wrote:
Hi, Andrew,

On Wed, Nov 30, 2022 at 02:24:25PM -0800, Andrew Morton wrote:
On Tue, 15 Nov 2022 19:17:43 +0100 David Hildenbrand <david@xxxxxxxxxx> wrote:

On 14.11.22 01:04, Peter Xu wrote:
Ives van Hoorne from codesandbox.io reported an issue regarding possible
data loss of uffd-wp when applied to memfds on heavily loaded systems.  The
symptom is some read page got data mismatch from the snapshot child VMs.

Here I can also reproduce with a Rust reproducer that was provided by Ives
that keeps taking snapshot of a 256MB VM, on a 32G system when I initiate
80 instances I can trigger the issues in ten minutes.

It turns out that we got some pages write-through even if uffd-wp is
applied to the pte.

The problem is, when removing migration entries, we didn't really worry
about write bit as long as we know it's not a write migration entry.  That
may not be true, for some memory types (e.g. writable shmem) mk_pte can
return a pte with write bit set, then to recover the migration entry to its
original state we need to explicit wr-protect the pte or it'll has the
write bit set if it's a read migration entry.  For uffd it can cause
write-through.

The relevant code on uffd was introduced in the anon support, which is
commit f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration",
2020-04-07).  However anon shouldn't suffer from this problem because anon
should already have the write bit cleared always, so that may not be a
proper Fixes target, while I'm adding the Fixes to be uffd shmem support.


...

--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -213,8 +213,14 @@ static bool remove_migration_pte(struct folio *folio,
   			pte = pte_mkdirty(pte);
   		if (is_writable_migration_entry(entry))
   			pte = maybe_mkwrite(pte, vma);
-		else if (pte_swp_uffd_wp(*pvmw.pte))
+		else
+			/* NOTE: mk_pte can have write bit set */
+			pte = pte_wrprotect(pte);
+
+		if (pte_swp_uffd_wp(*pvmw.pte)) {
+			WARN_ON_ONCE(pte_write(pte));

Will this warnnig trigger in the scenario you and Ives have discovered?

If without the above newly added wr-protect, yes.  This is the case where
we found we got write bit set even if uffd-wp bit is also set, hence allows
the write to go through even if marked protected.


   			pte = pte_mkuffd_wp(pte);
+		}
   
   		if (folio_test_anon(folio) && !is_readable_migration_entry(entry))
   			rmap_flags |= RMAP_EXCLUSIVE;

As raised, I don't agree to this generic non-uffd-wp change without
further, clear justification.

Pater, can you please work this further?

I didn't reply here because I have already replied with the question in
previous version with a few attempts.  Quotting myself:

https://lore.kernel.org/all/Y3KgYeMTdTM0FN5W@x1n/

         The thing is recovering the pte into its original form is the
         safest approach to me, so I think we need justification on why it's
         always safe to set the write bit.

I've also got another longer email trying to explain why I think it's the
other way round to be justfied, rather than justifying removal of the write
bit for a read migration entry, here:


And I disagree for this patch that is supposed to fix this hunk:


@@ -243,11 +243,15 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
                entry = pte_to_swp_entry(*pvmw.pte);
                if (is_write_migration_entry(entry))
                        pte = maybe_mkwrite(pte, vma);
+               else if (pte_swp_uffd_wp(*pvmw.pte))
+                       pte = pte_mkuffd_wp(pte);
 
                if (unlikely(is_zone_device_page(new))) {
                        if (is_device_private_page(new)) {
                                entry = make_device_private_entry(new, pte_write(pte));
                                pte = swp_entry_to_pte(entry);
+                               if (pte_swp_uffd_wp(*pvmw.pte))
+                                       pte = pte_mkuffd_wp(pte);
                        }
                }
 

There is really nothing to justify the other way around here.
If it's broken fix it independently and properly backport it independenty.

But we don't know about any such broken case.

I have no energy to spare to argue further ;)



--
Thanks,

David / dhildenb