On Tue, Mar 5, 2024 at 7:41 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Wed, Mar 06, 2024 at 12:15:10AM +0000, James Houghton wrote: > > I've tried to see if I can legitimately get a user to read stale data, > > and a few attempts with this test[2] have been unsuccessful. > > AFAICT that won't easily reproduce even if the problem existed, as we > contain so many implict memory barriers here and there. E.g. right at the > entry of ioctl(), mmget_not_zero() already contains a full ordering > constraint: > > /** > * atomic_inc_not_zero() - atomic increment unless zero with full ordering > * @v: pointer to atomic_t Oh yes, of course. Thanks for pointing this out. So we definitely don't need a Fixes. > > I was expecting the syscall routine will guarantee an ordering already but > indeed I can't find any. I also checked up Intel's spec and SYSCALL inst > document only has one paragraph on ordering: > > Instruction ordering. Instructions following a SYSCALL may be > fetched from memory before earlier instructions complete execution, > but they will not execute (even speculatively) until all > instructions prior to the SYSCALL have completed execution (the > later instructions may execute before data stored by the earlier > instructions have become globally visible). > > I guess it implies a hardware reordering is indeed possible in this case? That's my understanding as well. > > > > > [1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem") > > [2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9 > > > > mm/hugetlb.c | 15 +++++++++------ > > mm/userfaultfd.c | 18 ++++++++++++++++++ > > 2 files changed, 27 insertions(+), 6 deletions(-) > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index bb17e5c22759..533bf6b2d94d 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, > > } > > } > > > > - /* > > - * The memory barrier inside __folio_mark_uptodate makes sure that > > - * preceding stores to the page contents become visible before > > - * the set_pte_at() write. > > - */ > > - __folio_mark_uptodate(folio); > > + if (!is_continue) { > > + /* > > + * The memory barrier inside __folio_mark_uptodate makes sure > > + * that preceding stores to the page contents become visible > > + * before the set_pte_at() write. > > + */ > > + __folio_mark_uptodate(folio); > > Can we move the comment above the "if", explaining both conditions? Yes, I'll do that. I think the explanation for WARN_ON_ONCE(!folio_test_uptodate(folio)) is: We only need to `__folio_mark_uptodate(folio)` if we have allocated a new folio, and HugeTLB pages will always be Uptodate if they are in the pagecache. We could even drop the WARN_ON_ONCE. > > > + } else > > + WARN_ON_ONCE(!folio_test_uptodate(folio)); > > > > /* Add shared, newly allocated pages to the page cache. */ > > if (vm_shared && !is_continue) { > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index 503ea77c81aa..d515b640ca48 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -531,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_hugetlb( > > goto out_unlock; > > } > > > > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) > > + /* See the comment in mfill_atomic. */ > > + smp_wmb(); > > + > > while (src_addr < src_start + len) { > > BUG_ON(dst_addr >= dst_start + len); > > > > @@ -743,6 +747,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, > > uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) > > goto out_unlock; > > > > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) > > + /* > > + * A caller might reasonably assume that UFFDIO_CONTINUE > > + * contains a wmb() to ensure that any writes to the > > + * about-to-be-mapped page by the thread doing the > > + * UFFDIO_CONTINUE are guaranteed to be visible to subsequent > > + * reads of the page through the newly mapped address. > > + * > > + * For MFILL_ATOMIC_COPY, the wmb() is done for each COPYed > > + * page. We can do the wmb() now for CONTINUE as the user has > > + * already prepared the page contents. > > + */ > > + smp_wmb(); > > + > > Why you did it twice separately? Can we still share the code? > > I'm wildly guessing: I don't worry on an extra wmb() in failure paths, as > that's never a performance concern to make failure slightly slower, IMHO. You're right, I shouldn't care so much about making the slow paths a little bit slower. I'll move the wmb earlier so that we only need it in one place. > > Thanks, Thanks Peter!