Re: [PATCH] mm: Add an explicit smp_wmb() to UFFDIO_CONTINUE

Peter Xu <peterx@xxxxxxxxxx> · Wed, 6 Mar 2024 11:41:26 +0800

On Wed, Mar 06, 2024 at 12:15:10AM +0000, James Houghton wrote:
> Users of UFFDIO_CONTINUE may reasonably assume that a write memory
> barrier is included as part of UFFDIO_CONTINUE. That is, a user may
> (mistakenly) believe that all writes it has done to a page that it is
> now UFFDIO_CONTINUE'ing are guaranteed to be visible to anyone
> subsequently reading the page through the newly mapped virtual memory
> region.
> 
> Include only a single smp_wmb() for each UFFDIO_CONTINUE, as that is all
> that is necessary. While we're at it, optimize the smp_wmb() that is
> already incidentally present for the HugeTLB case.
> 
> Documentation doesn't specify if the kernel does a wmb(), so it's not
> wrong not to include it. But by not including it, we are making is easy
> for a user to have a very hard-to-detect bug. Include it now to be safe.
> 
> A user that decides to update the contents of the page in one thread and
> UFFDIO_CONTINUE that page in another must already take additional steps
> to synchronize properly.
> 
> Signed-off-by: James Houghton <jthoughton@xxxxxxxxxx>
> ---
> 
> I'm not sure if this patch should be merged. I think it's the right
> thing to do, as it is very easy for a user to get this wrong. (I have
> been using UFFDIO_CONTINUE for >2 years and only realized this problem
> recently.) Given that it's not a "bug" strictly speaking, even if this
> patch is a good idea, I'm unsure if it needs to be backported.
> 
> This quirk has existed since minor fault support was added for shmem[1].
> 
> I've tried to see if I can legitimately get a user to read stale data,
> and a few attempts with this test[2] have been unsuccessful.

AFAICT that won't easily reproduce even if the problem existed, as we
contain so many implict memory barriers here and there.  E.g. right at the
entry of ioctl(), mmget_not_zero() already contains a full ordering
constraint:

/**
 * atomic_inc_not_zero() - atomic increment unless zero with full ordering
 * @v: pointer to atomic_t

I was expecting the syscall routine will guarantee an ordering already but
indeed I can't find any.  I also checked up Intel's spec and SYSCALL inst
document only has one paragraph on ordering:

        Instruction ordering. Instructions following a SYSCALL may be
        fetched from memory before earlier instructions complete execution,
        but they will not execute (even speculatively) until all
        instructions prior to the SYSCALL have completed execution (the
        later instructions may execute before data stored by the earlier
        instructions have become globally visible).

I guess it implies a hardware reordering is indeed possible in this case?

> 
> [1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
> [2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9
> 
>  mm/hugetlb.c     | 15 +++++++++------
>  mm/userfaultfd.c | 18 ++++++++++++++++++
>  2 files changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bb17e5c22759..533bf6b2d94d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  		}
>  	}
>  
> -	/*
> -	 * The memory barrier inside __folio_mark_uptodate makes sure that
> -	 * preceding stores to the page contents become visible before
> -	 * the set_pte_at() write.
> -	 */
> -	__folio_mark_uptodate(folio);
> +	if (!is_continue) {
> +		/*
> +		 * The memory barrier inside __folio_mark_uptodate makes sure
> +		 * that preceding stores to the page contents become visible
> +		 * before the set_pte_at() write.
> +		 */
> +		__folio_mark_uptodate(folio);

Can we move the comment above the "if", explaining both conditions?

> +	} else
> +		WARN_ON_ONCE(!folio_test_uptodate(folio));
>  
>  	/* Add shared, newly allocated pages to the page cache. */
>  	if (vm_shared && !is_continue) {
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 503ea77c81aa..d515b640ca48 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -531,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
>  			goto out_unlock;
>  	}
>  
> +	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> +		/* See the comment in mfill_atomic. */
> +		smp_wmb();
> +
>  	while (src_addr < src_start + len) {
>  		BUG_ON(dst_addr >= dst_start + len);
>  
> @@ -743,6 +747,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>  	    uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
>  		goto out_unlock;
>  
> +	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> +		/*
> +		 * A caller might reasonably assume that UFFDIO_CONTINUE
> +		 * contains a wmb() to ensure that any writes to the
> +		 * about-to-be-mapped page by the thread doing the
> +		 * UFFDIO_CONTINUE are guaranteed to be visible to subsequent
> +		 * reads of the page through the newly mapped address.
> +		 *
> +		 * For MFILL_ATOMIC_COPY, the wmb() is done for each COPYed
> +		 * page. We can do the wmb() now for CONTINUE as the user has
> +		 * already prepared the page contents.
> +		 */
> +		smp_wmb();
> +

Why you did it twice separately?  Can we still share the code?

I'm wildly guessing: I don't worry on an extra wmb() in failure paths, as
that's never a performance concern to make failure slightly slower, IMHO.

Thanks,

>  	while (src_addr < src_start + len) {
>  		pmd_t dst_pmdval;
>  
> 
> base-commit: a7f399ae964e1d2a11d88d863a1d64392678ccaf
> -- 
> 2.44.0.278.ge034bb2e1d-goog
> 

-- 
Peter Xu