Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations

Peter Xu <peterx@xxxxxxxxxx> · Wed, 13 Jul 2022 12:02:11 -0400

On Tue, Jul 12, 2022 at 06:09:35PM -0700, Nadav Amit wrote:
> On Jul 12, 2022, at 7:56 AM, Peter Xu <peterx@xxxxxxxxxx> wrote:
> 
> > Hi, Nadav,
> > 
> > On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote:
> >> On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
> >> 
> >>> From: Nadav Amit <namit@xxxxxxxxxx>
> >>> 
> >>> Using a PTE on x86 with cleared access-bit (aka young-bit)
> >>> takes ~600 cycles more than when the access bit is set. At the same
> >>> time, setting the access-bit for memory that is not used (e.g.,
> >>> prefetched) can introduce greater overheads, as the prefetched memory is
> >>> reclaimed later than it should be.
> >>> 
> >>> Userfaultfd currently does not set the access-bit (excluding the
> >>> huge-pages case). Arguably, it is best to let the user control whether
> >>> the access bit should be set or not. The expected use is to request
> >>> userfaultfd to set the access-bit when the copy/wp operation is done to
> >>> resolve a page-fault, and not to set the access-bit when the memory is
> >>> prefetched.
> >>> 
> >>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
> >>> young bit to be set.
> >> 
> >> I reply to my own email, but this mostly addresses the concerns that Peter
> >> has raised.
> >> 
> >> So I ran the test below on my Haswell (x86), which showed two things:
> >> 
> >> 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles
> >> more than with dirty+young (depending on the access, of course: dirty
> >> does not matter for read, dirty+young both matter for write).
> >> 
> >> 2. I made a mistake in my implementation. PTEs are - at least on x86 -
> >> created as young with mk_pte(). So the logic should be similar to
> >> do_set_pte():
> >> 
> >> if (prefault && arch_wants_old_prefaulted_pte())
> >> entry = pte_mkold(entry);
> >> else
> >> entry = pte_sw_mkyoung(entry);
> >> 
> >> Based on these results, I will send another version for both young and
> >> dirty. Let me know if these results are not convincing.
> > 
> > Thanks for trying to verify this idea, but I'm not fully sure this is what
> > my concern was on WRITE_LIKELY.
> > 
> > AFAICT the test below was trying to measure the overhead of hardware
> > setting either access or dirty or both bits when they're not set for
> > read/write.
> 
> Indeed.
> 
> > 
> > What I wanted as a justification is whether WRITE_LIKELY would be helpful
> > in any real world scenario at all. AFAIK the only way to prove it so far
> > is to measure any tlb flush difference (probably only on x86, since that
> > tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not
> > trigger with W=0,D=0 (where W stands for "write bit", and D stands for
> > "dirty bit").
> > 
> > It's not about the slowness when D is cleared.
> > 
> > The core thing is (sorry to rephrase, but just hope we're on the same page)
> > we'll set D bit always for all uffd pages so far. Even if we want to
> > change that behavior so we skip setting D bit for RO pages (we'll need to
> > convert the dirty bit into PageDirty though), we'll still always set D bit
> > for writable pages. So we always set D bit as long as possible and we'll
> > never suffer from hardware overhead on setting D bit for uffd pages.
> 
> Thanks as usual for your clarifications. As you see, I also do my best to be
> on the same page with, even if from time to time I fail. I had some recent
> communication challenges on lkml, so I hope that you understand that
> everything I say is said with full respect, and if I use double-quotes while
> arguing with you, it is in good spirit, and I really appreciate your
> feedback.
> 
> Ok. So there is a lot to digest in what you just said, and I politely
> disagree with some of the assertions that you made. You focus on discussing
> the issue of whether we set the dirty bit for RO pages, which in my opinion
> is so intuitively wrong. But, I think that discussing this issue really
> digress us from the benefits of not setting the D-bit when it is unnecessary
> for RW pages, which is the main question.
> 
> But before we get to it, I want to argue with some of the “facts” that you
> present:
> 
> 1. "D bit always for all uffd pages” - This is true for almost all UFFD
> operations, but not true to write-unprotect. The moment we use David’s
> MM_CP_TRY_CHANGE_WRITABLE in UFFD, the PTE would be writable but not dirty.
> [Yes, the patches that I sent do not deal with that: as I noted before, I
> want to send it as part of v2.] Arguably, one of the places that setting the
> D-bit matters the most if when you change PTE from RO->RW. And anyhow, as
> you can see the API is inconsistent.

Yes.  This point is WRITE_HINT irrelevant, afaict, but I fully agree we
should fix it.  It'll be great if you'll cover that upon David's patch.

[1]

> 
> 2. "we'll still always set D bit for writable pages” - To continue my
> previous bullet - why? Why would we? Besides uffd-wrunprotect path, why
> would we always set it for MCOPY_CONTINUE? (more to follow on this one)

I wanted to mean that we used to always set both D bits when W=1.  IIUC
that applies to not only uffd but any general pgtable accesses.  E.g. in
do_anonymous_page() we'll set both D if W=1.  Same to anywhere I can find
across the kernel.

> 
> 3. "measure any tlb flush difference … that may trigger with W=0,D=1 but may
> not trigger with W=0,D=0” - This is really a boring case, which even if was
> underoptimized could have been resolved by its own. This is certainly not
> the motivation for the write hints.

Hmm?

I don't see what benefit we can get from the new WRITE_HINT besides
avoiding tlb flushes, or do we?  See right below..

> 
> 
> So please allow me to go back to the reasons for why you want write-hints,
> and RO entries would be mostly left out and implicitly included in the
> other arguments - which are mainly about RW entries:
> 
> 1. You can avoid TLB flushes when write-protecting clean writable PTEs (on
> x86). Such PTEs can be created when userspace monitor prefaults or
> speculatively write-unprotects memory that might be needed. When a monitor
> removes the mapping, using MADV_DONTNEED, it would not need to flush
> anything.

Yep, but afaict this is the "avoid tlb flush benefit" we talked above..

> 
> 2. Hopefully you agree that write-hints are needed if during
> uffd-write-unprotect

I think I agree on the unprotect above [1] on fixing D bit set if
unprotected.  If with that fix then I think whether WRITE_HINT would help
or not here for unprotect is the same as whether WRITE_HINT would help in
other cases like UFFDIO_COPY.  So these are two problems to me:

  (1) Whether we should fix change_pte_range() to set D properly when
      unprotect (W=1), again I fully agree.

  (2) Whether WRITE_HINT helps for unprotect is to be discussed here, and
      we're trying to reach a consensus..

> we actually write-unprotect the PTE (not just clearing
> the logical flag). So you do need write-hint for zero-page (to get a
> clear-page) and for write-unprotect, so why not to be consistent and provide
> it for all operations?

I always agree on the zeropage case.  But IMHO that's the only special case
here.

> 
> 3. For UFFDIO_CONTINUE. Admittedly, I am not very familiar with
> UFFDIO_CONTINUE, but from the discussion (and skimming the code) I
> understood that you can be used for prefetching. In such case, why would you
> assume that the page is dirty?

Again, I think it's the same as when we faulting in a normal anonymous
page: when we grant write bit we always assume it's dirty.  Yes it may not
really be written, but don't you think that's a more common question to ask
rather than uffd only?  IOW, why we always set D bit in do_anonymous_page()
even if there's possibility that the page may not be written at all?

Actually I'd say we'll have problem if we don't set dirty, because then we
won't be able to track soft-dirty anymore just like below - we need an
explicit page fault to trap soft-dirty.  With W=1, how do we do that?

> 
> 4. It allows you to treat softdirty properly. If softdirty is on,
> presumably, if you did not get a WRITE_HINT, you would keep it
> writeprotected.

Sorry I don't get why WRITE_HINT helps soft-dirty.  As I mentioned before,
I think it will start to complicate soft-dirty rather than making it
better.  Basically when the hints are enabled in feature bits and when the
user has !WRITE_HINT, we'll need to wr-protect the page for soft-dirty if
it's enabled.  We don't need that extra complexity if we don't apply
WRITE_HINT outside ZEROPAGE.  That's really even more than "hardware
setting D bit overhead" because it'll be a full path page fault just to
trap soft-dirty.

I am not sure we're really on the same page here, but I think I'll just
wait and read your code if it's coming and comment there if I see anything
wrong.  Maybe that'll be more straightforward.

> 
> 5. Consistency with UFFDIO_ZERO that needs a write-hint to clear a page.

This one is actually point 2 above.  Taken.

> 
> Now I will “kill myself” over support of write-hints. Not everything that
> I mentioned here is in v1 that I sent.
> 
> But, if you decide you do not want write-hints, I would still need to leave
> the UFFDIO_ZERO write-hints (that clear the page) and to find some solution
> for UFFDIO_WRITEPROTECT (that should set the dirty-bit for better performance
> of WP page-fault handling). 
> 
> If you decide you do want it, I would run some tests to check that indeed
> the access-time of UFFDIO_WRITEPROTECT reflects the fact no TLB flush was
> needed.
> 
> > The other worry of having WRITE_HINT is, after we have it we probably need
> > to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a
> > very light ABI change since we used to always set it), then I'll start to
> > worry the hardware setting D bit overhead you just measured because we'll
> > have that overhead when user didn't specify WRITE_HINT with the old code.
> 
> Good point, which I have already got to. So, because I was mistaken on the
> ACCESS_HINT understanding (young-bit is set by default), this would mean
> that the hints should only be regarded when the “hints” feature is enabled
> for backward compatibility. I got this code ready for v2.

Sounds good.

> 
> > 
> > So again, I'm totally fine if you want to start with ACCESS_HINT only, but
> > I still don't see why we should need WRITE_HINT too..
> 
> I really hate how the “features” evolve that I think it is better to decide
> now.
> 
> I think that in the lack of agreement, the best thing to do is to put all of
> the write-hints now in the API, and only regard them for UFFDIO_ZERO (which
> would clear the page) and UFFDIO_WRITE(un)PROTECT (that would set the old
> bit).

Yes let's fix unprotect on D bit first, because that's inconsistent.
Please put it as the 1st patch if you plan to post them together, I think
we should have it even earlier if not together with the uffd-hint series.

Then, there's one thing we can do as you said - we can introduce WRITE_HINT
but only apply it to ZEROPAGE only (note: I still don't see why unprotect
is special here, it just has one perf bug to fix).  It can be bound to
ACCESS_HINT with the same feature bit then we think about whether we want
to extend it to other uffd ioctls besides ZEROPAGE.  That's probably a good
split on the differences.

> 
> Again, thanks for the feedback, and hopefully (at least) I understood you
> this time.

What you suggested at the end sounds good to me, thanks for being
persistent.  The only thing uncertain seems to be whether we need
WRITE_HINT for UFFDIO_WRITEPROTECT, all the rest sounds a good plan.

Thanks,

-- 
Peter Xu