[PATCH] mm: Add an explicit smp_wmb() to UFFDIO_CONTINUE

James Houghton <jthoughton@xxxxxxxxxx> · Wed, 6 Mar 2024 00:15:10 +0000

Users of UFFDIO_CONTINUE may reasonably assume that a write memory
barrier is included as part of UFFDIO_CONTINUE. That is, a user may
(mistakenly) believe that all writes it has done to a page that it is
now UFFDIO_CONTINUE'ing are guaranteed to be visible to anyone
subsequently reading the page through the newly mapped virtual memory
region.

Include only a single smp_wmb() for each UFFDIO_CONTINUE, as that is all
that is necessary. While we're at it, optimize the smp_wmb() that is
already incidentally present for the HugeTLB case.

Documentation doesn't specify if the kernel does a wmb(), so it's not
wrong not to include it. But by not including it, we are making is easy
for a user to have a very hard-to-detect bug. Include it now to be safe.

A user that decides to update the contents of the page in one thread and
UFFDIO_CONTINUE that page in another must already take additional steps
to synchronize properly.

Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx>
---

I'm not sure if this patch should be merged. I think it's the right
thing to do, as it is very easy for a user to get this wrong. (I have
been using UFFDIO_CONTINUE for >2 years and only realized this problem
recently.) Given that it's not a "bug" strictly speaking, even if this
patch is a good idea, I'm unsure if it needs to be backported.

This quirk has existed since minor fault support was added for shmem[1].

I've tried to see if I can legitimately get a user to read stale data,
and a few attempts with this test[2] have been unsuccessful.

[1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
[2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9

 mm/hugetlb.c     | 15 +++++++++------
 mm/userfaultfd.c | 18 ++++++++++++++++++
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bb17e5c22759..533bf6b2d94d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 		}
 	}
 
-	/*
-	 * The memory barrier inside __folio_mark_uptodate makes sure that
-	 * preceding stores to the page contents become visible before
-	 * the set_pte_at() write.
-	 */
-	__folio_mark_uptodate(folio);
+	if (!is_continue) {
+		/*
+		 * The memory barrier inside __folio_mark_uptodate makes sure
+		 * that preceding stores to the page contents become visible
+		 * before the set_pte_at() write.
+		 */
+		__folio_mark_uptodate(folio);
+	} else
+		WARN_ON_ONCE(!folio_test_uptodate(folio));
 
 	/* Add shared, newly allocated pages to the page cache. */
 	if (vm_shared && !is_continue) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 503ea77c81aa..d515b640ca48 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -531,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 			goto out_unlock;
 	}
 
+	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+		/* See the comment in mfill_atomic. */
+		smp_wmb();
+
 	while (src_addr < src_start + len) {
 		BUG_ON(dst_addr >= dst_start + len);
 
@@ -743,6 +747,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
 		goto out_unlock;
 
+	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+		/*
+		 * A caller might reasonably assume that UFFDIO_CONTINUE
+		 * contains a wmb() to ensure that any writes to the
+		 * about-to-be-mapped page by the thread doing the
+		 * UFFDIO_CONTINUE are guaranteed to be visible to subsequent
+		 * reads of the page through the newly mapped address.
+		 *
+		 * For MFILL_ATOMIC_COPY, the wmb() is done for each COPYed
+		 * page. We can do the wmb() now for CONTINUE as the user has
+		 * already prepared the page contents.
+		 */
+		smp_wmb();
+
 	while (src_addr < src_start + len) {
 		pmd_t dst_pmdval;
 

base-commit: a7f399ae964e1d2a11d88d863a1d64392678ccaf
-- 
2.44.0.278.ge034bb2e1d-goog