Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations

Nadav Amit <namit@xxxxxxxxxx> · Mon, 27 Jun 2022 23:37:32 +0000

[ +Dave Hansen to say how wrong I am ] 

> On Jun 27, 2022, at 6:12 AM, Peter Xu <peterx@xxxxxxxxxx> wrote:
> 
> ⚠ External Email
> 
> On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote:
>> 
>> 
>>> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@xxxxxxxxxx> wrote:
>>> 
>>> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote:
>>>> [Sorry for replying late]
>>>> 
>>>> Said that, I think it doesn't really necessary need to be that complex,
>>>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC
>>>> what you need to do is simply make sure dirty bit set when write_hint=1.
>>>> 
>>>> Does it sounds correct to you?
>>> 
>>> Hmm, hold on...  I failed to figure out how that write-likely hint could
>>> help us for either huge or non-huge pages, since:
>>> 
>>> (1) Old code always set dirty, so no perf degrade anyway with/without the
>>>     hint
>>> 
>>> (2) If we want to rework dirty bit (which I'm totally fine with..), then
>>>     we don't apply it when we shouldn't, and afaict we should set D bit
>>>     whenever we should...  if the user assumes this page is likely to be
>>>     written but made it read-only, say, with UFFDIO_COPY(wp_mode=1),
>>>     setting D bit will not help, instead, the user should simply use an
>>>     UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1..
>>> 
>>> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one
>>> COW.  But that seems to be it.
>>> 
>>> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint
>>> as you proposed earlier.  We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE
>>> separately, but keep that only for zeropage op (and it shouldn't really be
>>> called WRITE_LIKELY)?  Or did I miss something?
>> 
>> Let’s see if I get you correctly. I am not sure whether we had this
>> discussion before.
>> 
>> We are talking about a scenario in which WP=0. You argue that if the page
>> is already set as dirty, what is the benefit of not setting the dirty-bit,
>> right?
>> 
>> So first, IIUC, there are cases in which the page would not be set as
>> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this
>> use-case, so I say it based on the comments. ]
>> 
>> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it
>> is not written by the user after UFFDI_COPY, marking the PTE as dirty
>> when it is mapped would induce overhead, as we discussed before, since
>> if/when the PTE is unmapped, TLB flush batching might not be possible.
> 
> I'd hope we don't make an interface design just to service that purpose of
> when write=0 and dirty=1 use case that is internal to the kernel so far,
> and I still think it's the tlb flush code to change.. or do we have other
> use case for this WRITE_LIKELY hint?
> 
> For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then
> IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the
> user writes to the page in the other mapping, where PageDirty() will start
> to be true already even if the pte that to be CONTINUEd will have dirty=0
> in the pte entry.  From that pov I still don't see why we need to grant the
> user on the dirty bit control, no matter with a hint only, or explicit.
> 
>> 
>> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover,
>> I would reiterate my position (which you guys convinced me in!)
> 
> David convinced you I think :)
> 
>> that having hints that indicate what the user does (WRITE_LIKELY) is a
>> better API than something that indicates directly what the kernel should
>> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE).
> 
> The hint idea sounds good to me, it's just that we actually have two steps
> here:
> 
>  (1) We think providing user the control of dirty bit makes sense, then,
>  (2) We think the flag should be a hint not explicit "set dirty bit"
> 
> I agree with (2) in this case if (1) is applicable.  And now I think I'm
> questioning myself on (1).
> 
> Fundamentally, access bit has more meaningful context (0 means cold, 1
> means hot), for dirty it's really more a perf thing to me (when clear,
> it'll take extra cycles to set it when memory write happens to it; being
> clear _may_ help only for the tlb flush example you mentioned but I'm not
> fully convinced that's correct).

I am not sure we understand each other. I think the benefit of not setting
a dirty-bit when a page is not actually written is fundamental, and has
inherit performance benefit.

When I did x86’s pte_flags_need_flush(), I was defensive, but there is a
basic optimization that is possible to avoid a TLB flush on non-dirty
writable PTEs.

In x86, consider a situation in which you use ptep_modify_prot_start()
to remove a PTE and load its old value using xchg. (A similar case happens
on reclaim). Assume you want to write-protect the entry.

If the PTE is non-dirty then you should be able to avoid a flush, even if
the PTE is writable. In x86, a write and the change of the dirty-bit are
performed both atomically. Therefore, if the dirty-bit on the old PTE was
clear, you can avoid a TLB flush.

Besides the benefit of avoiding a TLB flush, there is also the benefit
of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be
preceded by memory write to the shared memory, but that does not have to
be the case. Similarly, if in the future userfaultfd would also support
memory-backed private mappings, that does not have to be the case either.

Putting all of the above aside, there is a bug in my code, but this
bug also points why dirty should not be set unconditionally. If someone
uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and
soft-dirty) might be misleading, causing unnecessary userspace writeback
of memory.

So I do need to fix my code so it would not write-unprotect memory if
soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But
I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY.

> 
> Maybe with the to be proposed RFC patch for tlb flush we can know whether
> that should be something we can rely on.  It'll add more dependency on this
> work which I'm sorry to say.  It's just that IMHO we should think carefully
> for the write-hint because this is a solid new uABI we're talking about.
> 
> The other option is we can introduce the access hint first and think more
> on the dirty one (we can always add it when proper).  What do you think?
> Also, David please chim in anytime if I missed the whole point when you
> proposed the idea.
> 
>> 
>> But this discussion made me think that there are two somewhat related
>> matters that we may want to address:
>> 
>> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp
>> to proactively make entries writable and save .
> 
> I'm not sure I'm right here, but I think David's patch should have covered
> that case?  The new helper only checks pte_uffd_wp() based on my memory,
> and when resolving page faults uffd-wp bit should have been gone, so it
> should be treated the same as normal ptes.

Let’s see we get to the same page:

mwriteprotect_range() does:

        change_protection(&tlb, dst_vma, start, start + len, newprot,
                          enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE)

As you see no use of MM_CP_TRY_CHANGE_WRITABLE.

And then change_pte_range() does:

                        if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
                            !pte_write(ptent) &&
                            can_change_pte_writable(vma, addr, ptent))
                                ptent = pte_mkwrite(ptent);

If we do not provide MM_CP_TRY_CHANGE_WRITABLE, the PTE would not be
writable.

Now for the record, we may want an additional optimization, so if a
fault handler is woken because a PTE become writable, we do not retry
the page fault (since the PTE is already writable). It is a small change,
but let’s see first we agree on the patches so far…