Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations

Nadav Amit <nadav.amit@xxxxxxxxx> · Tue, 12 Jul 2022 18:09:35 -0700

On Jul 12, 2022, at 7:56 AM, Peter Xu <peterx@xxxxxxxxxx> wrote:

> Hi, Nadav,
> 
> On Tue, Jul 12, 2022 at 06:19:08AM +0000, Nadav Amit wrote:
>> On Jun 22, 2022, at 11:50 AM, Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>> 
>>> From: Nadav Amit <namit@xxxxxxxxxx>
>>> 
>>> Using a PTE on x86 with cleared access-bit (aka young-bit)
>>> takes ~600 cycles more than when the access bit is set. At the same
>>> time, setting the access-bit for memory that is not used (e.g.,
>>> prefetched) can introduce greater overheads, as the prefetched memory is
>>> reclaimed later than it should be.
>>> 
>>> Userfaultfd currently does not set the access-bit (excluding the
>>> huge-pages case). Arguably, it is best to let the user control whether
>>> the access bit should be set or not. The expected use is to request
>>> userfaultfd to set the access-bit when the copy/wp operation is done to
>>> resolve a page-fault, and not to set the access-bit when the memory is
>>> prefetched.
>>> 
>>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
>>> young bit to be set.
>> 
>> I reply to my own email, but this mostly addresses the concerns that Peter
>> has raised.
>> 
>> So I ran the test below on my Haswell (x86), which showed two things:
>> 
>> 1. Accessing an address using a clean PTE or old PTE takes ~500 cycles
>> more than with dirty+young (depending on the access, of course: dirty
>> does not matter for read, dirty+young both matter for write).
>> 
>> 2. I made a mistake in my implementation. PTEs are - at least on x86 -
>> created as young with mk_pte(). So the logic should be similar to
>> do_set_pte():
>> 
>> if (prefault && arch_wants_old_prefaulted_pte())
>> entry = pte_mkold(entry);
>> else
>> entry = pte_sw_mkyoung(entry);
>> 
>> Based on these results, I will send another version for both young and
>> dirty. Let me know if these results are not convincing.
> 
> Thanks for trying to verify this idea, but I'm not fully sure this is what
> my concern was on WRITE_LIKELY.
> 
> AFAICT the test below was trying to measure the overhead of hardware
> setting either access or dirty or both bits when they're not set for
> read/write.

Indeed.

> 
> What I wanted as a justification is whether WRITE_LIKELY would be helpful
> in any real world scenario at all. AFAIK the only way to prove it so far
> is to measure any tlb flush difference (probably only on x86, since that
> tlb code is only compiled on x86) that may trigger with W=0,D=1 but may not
> trigger with W=0,D=0 (where W stands for "write bit", and D stands for
> "dirty bit").
> 
> It's not about the slowness when D is cleared.
> 
> The core thing is (sorry to rephrase, but just hope we're on the same page)
> we'll set D bit always for all uffd pages so far. Even if we want to
> change that behavior so we skip setting D bit for RO pages (we'll need to
> convert the dirty bit into PageDirty though), we'll still always set D bit
> for writable pages. So we always set D bit as long as possible and we'll
> never suffer from hardware overhead on setting D bit for uffd pages.

Thanks as usual for your clarifications. As you see, I also do my best to be
on the same page with, even if from time to time I fail. I had some recent
communication challenges on lkml, so I hope that you understand that
everything I say is said with full respect, and if I use double-quotes while
arguing with you, it is in good spirit, and I really appreciate your
feedback.

Ok. So there is a lot to digest in what you just said, and I politely
disagree with some of the assertions that you made. You focus on discussing
the issue of whether we set the dirty bit for RO pages, which in my opinion
is so intuitively wrong. But, I think that discussing this issue really
digress us from the benefits of not setting the D-bit when it is unnecessary
for RW pages, which is the main question.

But before we get to it, I want to argue with some of the “facts” that you
present:

1. "D bit always for all uffd pages” - This is true for almost all UFFD
operations, but not true to write-unprotect. The moment we use David’s
MM_CP_TRY_CHANGE_WRITABLE in UFFD, the PTE would be writable but not dirty.
[Yes, the patches that I sent do not deal with that: as I noted before, I
want to send it as part of v2.] Arguably, one of the places that setting the
D-bit matters the most if when you change PTE from RO->RW. And anyhow, as
you can see the API is inconsistent.

2. "we'll still always set D bit for writable pages” - To continue my
previous bullet - why? Why would we? Besides uffd-wrunprotect path, why
would we always set it for MCOPY_CONTINUE? (more to follow on this one)

3. "measure any tlb flush difference … that may trigger with W=0,D=1 but may
not trigger with W=0,D=0” - This is really a boring case, which even if was
underoptimized could have been resolved by its own. This is certainly not
the motivation for the write hints.

So please allow me to go back to the reasons for why you want write-hints,
and RO entries would be mostly left out and implicitly included in the
other arguments - which are mainly about RW entries:

1. You can avoid TLB flushes when write-protecting clean writable PTEs (on
x86). Such PTEs can be created when userspace monitor prefaults or
speculatively write-unprotects memory that might be needed. When a monitor
removes the mapping, using MADV_DONTNEED, it would not need to flush
anything.

2. Hopefully you agree that write-hints are needed if during
uffd-write-unprotect we actually write-unprotect the PTE (not just clearing
the logical flag). So you do need write-hint for zero-page (to get a
clear-page) and for write-unprotect, so why not to be consistent and provide
it for all operations?

3. For UFFDIO_CONTINUE. Admittedly, I am not very familiar with
UFFDIO_CONTINUE, but from the discussion (and skimming the code) I
understood that you can be used for prefetching. In such case, why would you
assume that the page is dirty?

4. It allows you to treat softdirty properly. If softdirty is on,
presumably, if you did not get a WRITE_HINT, you would keep it
writeprotected.

5. Consistency with UFFDIO_ZERO that needs a write-hint to clear a page.

Now I will “kill myself” over support of write-hints. Not everything that
I mentioned here is in v1 that I sent.

But, if you decide you do not want write-hints, I would still need to leave
the UFFDIO_ZERO write-hints (that clear the page) and to find some solution
for UFFDIO_WRITEPROTECT (that should set the dirty-bit for better performance
of WP page-fault handling). 

If you decide you do want it, I would run some tests to check that indeed
the access-time of UFFDIO_WRITEPROTECT reflects the fact no TLB flush was
needed.

> The other worry of having WRITE_HINT is, after we have it we probably need
> to _not_ apply dirty bit when WRITE_HINT is not set (which is actually a
> very light ABI change since we used to always set it), then I'll start to
> worry the hardware setting D bit overhead you just measured because we'll
> have that overhead when user didn't specify WRITE_HINT with the old code.

Good point, which I have already got to. So, because I was mistaken on the
ACCESS_HINT understanding (young-bit is set by default), this would mean
that the hints should only be regarded when the “hints” feature is enabled
for backward compatibility. I got this code ready for v2.

> 
> So again, I'm totally fine if you want to start with ACCESS_HINT only, but
> I still don't see why we should need WRITE_HINT too..

I really hate how the “features” evolve that I think it is better to decide
now.

I think that in the lack of agreement, the best thing to do is to put all of
the write-hints now in the API, and only regard them for UFFDIO_ZERO (which
would clear the page) and UFFDIO_WRITE(un)PROTECT (that would set the old
bit).

Again, thanks for the feedback, and hopefully (at least) I understood you
this time.