Users of UFFDIO_CONTINUE may reasonably assume that a write memory barrier is included as part of UFFDIO_CONTINUE. That is, a user may (mistakenly) believe that all writes it has done to a page that it is now UFFDIO_CONTINUE'ing are guaranteed to be visible to anyone subsequently reading the page through the newly mapped virtual memory region. Include only a single smp_wmb() for each UFFDIO_CONTINUE, as that is all that is necessary. While we're at it, optimize the smp_wmb() that is already incidentally present for the HugeTLB case. Documentation doesn't specify if the kernel does a wmb(), so it's not wrong not to include it. But by not including it, we are making is easy for a user to have a very hard-to-detect bug. Include it now to be safe. A user that decides to update the contents of the page in one thread and UFFDIO_CONTINUE that page in another must already take additional steps to synchronize properly. Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx> --- I'm not sure if this patch should be merged. I think it's the right thing to do, as it is very easy for a user to get this wrong. (I have been using UFFDIO_CONTINUE for >2 years and only realized this problem recently.) Given that it's not a "bug" strictly speaking, even if this patch is a good idea, I'm unsure if it needs to be backported. This quirk has existed since minor fault support was added for shmem[1]. I've tried to see if I can legitimately get a user to read stale data, and a few attempts with this test[2] have been unsuccessful. [1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem") [2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9 mm/hugetlb.c | 15 +++++++++------ mm/userfaultfd.c | 18 ++++++++++++++++++ 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bb17e5c22759..533bf6b2d94d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, } } - /* - * The memory barrier inside __folio_mark_uptodate makes sure that - * preceding stores to the page contents become visible before - * the set_pte_at() write. - */ - __folio_mark_uptodate(folio); + if (!is_continue) { + /* + * The memory barrier inside __folio_mark_uptodate makes sure + * that preceding stores to the page contents become visible + * before the set_pte_at() write. + */ + __folio_mark_uptodate(folio); + } else + WARN_ON_ONCE(!folio_test_uptodate(folio)); /* Add shared, newly allocated pages to the page cache. */ if (vm_shared && !is_continue) { diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 503ea77c81aa..d515b640ca48 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -531,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_hugetlb( goto out_unlock; } + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + /* See the comment in mfill_atomic. */ + smp_wmb(); + while (src_addr < src_start + len) { BUG_ON(dst_addr >= dst_start + len); @@ -743,6 +747,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) goto out_unlock; + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + /* + * A caller might reasonably assume that UFFDIO_CONTINUE + * contains a wmb() to ensure that any writes to the + * about-to-be-mapped page by the thread doing the + * UFFDIO_CONTINUE are guaranteed to be visible to subsequent + * reads of the page through the newly mapped address. + * + * For MFILL_ATOMIC_COPY, the wmb() is done for each COPYed + * page. We can do the wmb() now for CONTINUE as the user has + * already prepared the page contents. + */ + smp_wmb(); + while (src_addr < src_start + len) { pmd_t dst_pmdval; base-commit: a7f399ae964e1d2a11d88d863a1d64392678ccaf -- 2.44.0.278.ge034bb2e1d-goog