Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote:
> [ +Dave Hansen to say how wrong I am ] 
> 
> > On Jun 27, 2022, at 6:12 AM, Peter Xu <peterx@xxxxxxxxxx> wrote:
> > 
> > ⚠ External Email
> > 
> > On Sat, Jun 25, 2022 at 07:49:54AM +0000, Nadav Amit wrote:
> >> 
> >> 
> >>> On Jun 24, 2022, at 3:17 PM, Peter Xu <peterx@xxxxxxxxxx> wrote:
> >>> 
> >>> On Fri, Jun 24, 2022 at 05:58:17PM -0400, Peter Xu wrote:
> >>>> [Sorry for replying late]
> >>>> 
> >>>> Said that, I think it doesn't really necessary need to be that complex,
> >>>> since make_huge_pte() already sets dirty bit when "writable=1", so IIUC
> >>>> what you need to do is simply make sure dirty bit set when write_hint=1.
> >>>> 
> >>>> Does it sounds correct to you?
> >>> 
> >>> Hmm, hold on...  I failed to figure out how that write-likely hint could
> >>> help us for either huge or non-huge pages, since:
> >>> 
> >>> (1) Old code always set dirty, so no perf degrade anyway with/without the
> >>>     hint
> >>> 
> >>> (2) If we want to rework dirty bit (which I'm totally fine with..), then
> >>>     we don't apply it when we shouldn't, and afaict we should set D bit
> >>>     whenever we should...  if the user assumes this page is likely to be
> >>>     written but made it read-only, say, with UFFDIO_COPY(wp_mode=1),
> >>>     setting D bit will not help, instead, the user should simply use an
> >>>     UFFDIO_COPY(wp_mode=0) then the dirty will be set with write=1..
> >>> 
> >>> It'll be helpful but only helpful for UFFDIO_ZEROCOPY because it avoids one
> >>> COW.  But that seems to be it.
> >>> 
> >>> In short: I'm wondering whether we only really need the ACCESS_LIKELY hint
> >>> as you proposed earlier.  We may want UFFDIO_ZEROPAGE_MODE_ALLOCATE
> >>> separately, but keep that only for zeropage op (and it shouldn't really be
> >>> called WRITE_LIKELY)?  Or did I miss something?
> >> 
> >> Let’s see if I get you correctly. I am not sure whether we had this
> >> discussion before.
> >> 
> >> We are talking about a scenario in which WP=0. You argue that if the page
> >> is already set as dirty, what is the benefit of not setting the dirty-bit,
> >> right?
> >> 
> >> So first, IIUC, there are cases in which the page would not be set as
> >> dirty, e.g., UFFDIO_CONTINUE. [ I am admittedly not too familiar with this
> >> use-case, so I say it based on the comments. ]
> >> 
> >> Second, even if the page is dirty (e.g., following UFFDIO_COPY), but it
> >> is not written by the user after UFFDI_COPY, marking the PTE as dirty
> >> when it is mapped would induce overhead, as we discussed before, since
> >> if/when the PTE is unmapped, TLB flush batching might not be possible.
> > 
> > I'd hope we don't make an interface design just to service that purpose of
> > when write=0 and dirty=1 use case that is internal to the kernel so far,
> > and I still think it's the tlb flush code to change.. or do we have other
> > use case for this WRITE_LIKELY hint?
> > 
> > For UFFDIO_CONTINUE, if we want to make things clear on dirty bit, then
> > IMHO for UFFDIO_CONTINUE the right place for the dirty process is where the
> > user writes to the page in the other mapping, where PageDirty() will start
> > to be true already even if the pte that to be CONTINUEd will have dirty=0
> > in the pte entry.  From that pov I still don't see why we need to grant the
> > user on the dirty bit control, no matter with a hint only, or explicit.
> > 
> >> 
> >> So I don’t think there is a problem in having WRITE_LIKELY hint. Moreover,
> >> I would reiterate my position (which you guys convinced me in!)
> > 
> > David convinced you I think :)
> > 
> >> that having hints that indicate what the user does (WRITE_LIKELY) is a
> >> better API than something that indicates directly what the kernel should
> >> do (e.g., UFFDIO_ZEROPAGE_MODE_ALLOCATE).
> > 
> > The hint idea sounds good to me, it's just that we actually have two steps
> > here:
> > 
> >  (1) We think providing user the control of dirty bit makes sense, then,
> >  (2) We think the flag should be a hint not explicit "set dirty bit"
> > 
> > I agree with (2) in this case if (1) is applicable.  And now I think I'm
> > questioning myself on (1).
> > 
> > Fundamentally, access bit has more meaningful context (0 means cold, 1
> > means hot), for dirty it's really more a perf thing to me (when clear,
> > it'll take extra cycles to set it when memory write happens to it; being
> > clear _may_ help only for the tlb flush example you mentioned but I'm not
> > fully convinced that's correct).
> 
> I am not sure we understand each other. I think the benefit of not setting
> a dirty-bit when a page is not actually written is fundamental, and has
> inherit performance benefit.
> 
> When I did x86’s pte_flags_need_flush(), I was defensive, but there is a
> basic optimization that is possible to avoid a TLB flush on non-dirty
> writable PTEs.
> 
> In x86, consider a situation in which you use ptep_modify_prot_start()
> to remove a PTE and load its old value using xchg. (A similar case happens
> on reclaim). Assume you want to write-protect the entry.
> 
> If the PTE is non-dirty then you should be able to avoid a flush, even if
> the PTE is writable. In x86, a write and the change of the dirty-bit are
> performed both atomically. Therefore, if the dirty-bit on the old PTE was
> clear, you can avoid a TLB flush.
> 
> Besides the benefit of avoiding a TLB flush, there is also the benefit
> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be
> preceded by memory write to the shared memory, but that does not have to
> be the case. Similarly, if in the future userfaultfd would also support
> memory-backed private mappings, that does not have to be the case either.

Could I ask what's the not supported private mapping you're talking about
(and I think you meant "file-backed")?  I thought all kinds of private
mappings are supported on all modes already?

Thanks for explaining, so yes either false positive or false negative on
pte dirty bit can bring a side effect on things, and the user knows the
best.  Makes sense.  I think what I overlooked is the downside of having
both write==dirty==1 when it doesn't need to.

> 
> Putting all of the above aside, there is a bug in my code, but this
> bug also points why dirty should not be set unconditionally. If someone
> uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and
> soft-dirty) might be misleading, causing unnecessary userspace writeback
> of memory.

Well I never thought anyone would be using soft-dirty with uffd because
logically uffd-wp was somehow trying to replace it with a better interface,
hopefully.

If to talk about it, IMHO the most important thing is really not the dirty
bit but the write bit: even if you leave both the dirty bits cleared (so
the user specified !WRITE_HINT then we grant write bit only), it means when
the write happens for sure dirty bit will be set on demand but not
soft-dirty since we won't generate a page fault at all.  AFAICT it'll be a
false negative.

The old code is safe in that respect on that we always set dirty so we can
only have false positive (which is acceptable in this case) but not false
negative (which is not).

> 
> So I do need to fix my code so it would not write-unprotect memory if
> soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But
> I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY.

Sorry to say that, but to me it seems a proof that dirty bit is even more
tricky than access bit so we need to have better thoughts, even if the
WRITE_LIKELY hint sounds straightforward.  I now can buy in all the tlb
flush points you explained, but still IMHO that's too fine granule (some
random ptes in a batched tlb range flush may not even matter!) and I'm just
naively wondering whether we need more thoughts for an uABI to be proposed
like that.  I won't ask for some use cases and especially numbers to prove
assuming I'm asking too much, but really that'll be very persuasive if we
can get.  I'd not worry for ACCESS_LIKELY on it because that's much more
straightforward to me.

> 
> > 
> > Maybe with the to be proposed RFC patch for tlb flush we can know whether
> > that should be something we can rely on.  It'll add more dependency on this
> > work which I'm sorry to say.  It's just that IMHO we should think carefully
> > for the write-hint because this is a solid new uABI we're talking about.
> > 
> > The other option is we can introduce the access hint first and think more
> > on the dirty one (we can always add it when proper).  What do you think?
> > Also, David please chim in anytime if I missed the whole point when you
> > proposed the idea.
> > 
> >> 
> >> But this discussion made me think that there are two somewhat related
> >> matters that we may want to address:
> >> 
> >> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp
> >> to proactively make entries writable and save .
> > 
> > I'm not sure I'm right here, but I think David's patch should have covered
> > that case?  The new helper only checks pte_uffd_wp() based on my memory,
> > and when resolving page faults uffd-wp bit should have been gone, so it
> > should be treated the same as normal ptes.
> 
> Let’s see we get to the same page:
> 
> mwriteprotect_range() does:
> 
>         change_protection(&tlb, dst_vma, start, start + len, newprot,
>                           enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE)
> 
> As you see no use of MM_CP_TRY_CHANGE_WRITABLE.
> 
> And then change_pte_range() does:
> 
>                         if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>                             !pte_write(ptent) &&
>                             can_change_pte_writable(vma, addr, ptent))
>                                 ptent = pte_mkwrite(ptent);
> 
> If we do not provide MM_CP_TRY_CHANGE_WRITABLE, the PTE would not be
> writable.
> 
> Now for the record, we may want an additional optimization, so if a
> fault handler is woken because a PTE become writable, we do not retry
> the page fault (since the PTE is already writable). It is a small change,
> but let’s see first we agree on the patches so far…

Ah I see what I missed, thanks.

Another option is we call can_change_pte_writable() somehow in the
unprotect branch.

-- 
Peter Xu





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux