Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations

Nadav Amit <namit@xxxxxxxxxx> · Tue, 28 Jun 2022 20:30:46 +0000

> On Jun 28, 2022, at 12:15 PM, Peter Xu <peterx@xxxxxxxxxx> wrote:
> 
> ⚠ External Email
> 
> On Mon, Jun 27, 2022 at 11:37:32PM +0000, Nadav Amit wrote:
>> 
>> Besides the benefit of avoiding a TLB flush, there is also the benefit
>> of having more precise dirty tracking. You assume UFFDIO_CONTINUE will be
>> preceded by memory write to the shared memory, but that does not have to
>> be the case. Similarly, if in the future userfaultfd would also support
>> memory-backed private mappings, that does not have to be the case either.
> 
> Could I ask what's the not supported private mapping you're talking about
> (and I think you meant "file-backed")? I thought all kinds of private
> mappings are supported on all modes already?

Yes, I meant file-backed. See vma_can_userfault() for the supported
types of memory:

  static inline bool vma_can_userfault(struct vm_area_struct *vma,
                                     unsigned long vm_flags)
  {       
        if (vm_flags & VM_UFFD_MINOR)
                return is_vm_hugetlb_page(vma) || vma_is_shmem(vma);

#ifndef CONFIG_PTE_MARKER_UFFD_WP
        /*
         * If user requested uffd-wp but not enabled pte markers for
         * uffd-wp, then shmem & hugetlbfs are not supported but only
         * anonymous.
         */             
        if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
                return false;
#endif
        return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
            vma_is_shmem(vma);
  }

> 
> Thanks for explaining, so yes either false positive or false negative on
> pte dirty bit can bring a side effect on things, and the user knows the
> best. Makes sense. I think what I overlooked is the downside of having
> both write==dirty==1 when it doesn't need to.
> 
>> 
>> Putting all of the above aside, there is a bug in my code, but this
>> bug also points why dirty should not be set unconditionally. If someone
>> uses SOFT_DIRTY with userfaultfd, then marking the PTE as dirty (and
>> soft-dirty) might be misleading, causing unnecessary userspace writeback
>> of memory.
> 
> Well I never thought anyone would be using soft-dirty with uffd because
> logically uffd-wp was somehow trying to replace it with a better interface,
> hopefully.

I have heard about some who does (not me). So I do not make it up,
unfortunately.

> 
> If to talk about it, IMHO the most important thing is really not the dirty
> bit but the write bit: even if you leave both the dirty bits cleared (so
> the user specified !WRITE_HINT then we grant write bit only), it means when
> the write happens for sure dirty bit will be set on demand but not
> soft-dirty since we won't generate a page fault at all. AFAICT it'll be a
> false negative.
> 
> The old code is safe in that respect on that we always set dirty so we can
> only have false positive (which is acceptable in this case) but not false
> negative (which is not).

I agree that the PTE should be left RO in this case. I was just pointing
that the user might not be aware of soft-dirty being used, and then it
makes sense to treat the (lack of) write-hint appropriately.

In the current code, when PTEs are unconditionally marked as dirty, even
when they are write-protected, soft-dirty would produce false-positives.
Nothing would break, but it is not good for performance either.

> 
>> 
>> So I do need to fix my code so it would not write-unprotect memory if
>> soft-dirty is enabled and UFFD_FLAGS_WRITE_LIKELY is not provided. But
>> I think it emphasizes the benefit of having UFFD_FLAGS_WRITE_LIKELY.
> 
> Sorry to say that, but to me it seems a proof that dirty bit is even more
> tricky than access bit so we need to have better thoughts, even if the
> WRITE_LIKELY hint sounds straightforward. I now can buy in all the tlb
> flush points you explained, but still IMHO that's too fine granule (some
> random ptes in a batched tlb range flush may not even matter!) and I'm just
> naively wondering whether we need more thoughts for an uABI to be proposed
> like that. I won't ask for some use cases and especially numbers to prove
> assuming I'm asking too much, but really that'll be very persuasive if we
> can get. I'd not worry for ACCESS_LIKELY on it because that's much more
> straightforward to me.

Let me get back to you with numbers.

> 
>> 
>>> 
>>> Maybe with the to be proposed RFC patch for tlb flush we can know whether
>>> that should be something we can rely on. It'll add more dependency on this
>>> work which I'm sorry to say. It's just that IMHO we should think carefully
>>> for the write-hint because this is a solid new uABI we're talking about.
>>> 
>>> The other option is we can introduce the access hint first and think more
>>> on the dirty one (we can always add it when proper). What do you think?
>>> Also, David please chim in anytime if I missed the whole point when you
>>> proposed the idea.
>>> 
>>>> 
>>>> But this discussion made me think that there are two somewhat related
>>>> matters that we may want to address:
>>>> 
>>>> 1. mwriteprotect_range() should use MM_CP_TRY_CHANGE_WRITABLE when !wp
>>>> to proactively make entries writable and save .
>>> 
>>> I'm not sure I'm right here, but I think David's patch should have covered
>>> that case? The new helper only checks pte_uffd_wp() based on my memory,
>>> and when resolving page faults uffd-wp bit should have been gone, so it
>>> should be treated the same as normal ptes.
>> 
>> Let’s see we get to the same page:
>> 
>> mwriteprotect_range() does:
>> 
>> change_protection(&tlb, dst_vma, start, start + len, newprot,
>> enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE)
>> 
>> As you see no use of MM_CP_TRY_CHANGE_WRITABLE.
>> 
>> And then change_pte_range() does:
>> 
>> if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) &&
>> !pte_write(ptent) &&
>> can_change_pte_writable(vma, addr, ptent))
>> ptent = pte_mkwrite(ptent);
>> 
>> If we do not provide MM_CP_TRY_CHANGE_WRITABLE, the PTE would not be
>> writable.
>> 
>> Now for the record, we may want an additional optimization, so if a
>> fault handler is woken because a PTE become writable, we do not retry
>> the page fault (since the PTE is already writable). It is a small change,
>> but let’s see first we agree on the patches so far…
> 
> Ah I see what I missed, thanks.
> 
> Another option is we call can_change_pte_writable() somehow in the
> unprotect branch.

I don’t think that I got this one.

tl;dr: I’ll run some numbers (access when PTE-dirty/clear), address the
aforementioned issues and send v2. Based on the results we can decide
whether it makes sense.