Re: [PATCH v2 2/5] userfaultfd: introduce access-likely mode for common operations

Peter Xu <peterx@xxxxxxxxxx> · Mon, 18 Jul 2022 17:21:47 -0400

On Mon, Jul 18, 2022 at 08:59:37PM +0000, Nadav Amit wrote:
> On Jul 18, 2022, at 1:05 PM, Peter Xu <peterx@xxxxxxxxxx> wrote:
> 
> > ⚠ External Email
> > 
> > On Mon, Jul 18, 2022 at 04:47:45AM -0700, Nadav Amit wrote:
> >> @@ -261,6 +272,7 @@ struct uffdio_copy {
> >> struct uffdio_zeropage {
> >>      struct uffdio_range range;
> >> #define UFFDIO_ZEROPAGE_MODE_DONTWAKE                ((__u64)1<<0)
> >> +#define UFFDIO_ZEROPAGE_MODE_ACCESS_LIKELY   ((__u64)1<<1)
> > 
> > Would access hint help zeropage use case?  I remembered you used to comment
> > around and said it won't help since we won't reclaim zero page anyway.
> 
> I agree that there is no meaning for access bit on zero page. I just think
> that it is best to have the flags for consistency. If you ask me, I would
> prefer to have all the flags in a fixed place (highest bits?). Anyhow, if we
> expose the hints as a feature, I do not think we would later want to say
> “here is another feature that enables another hint that we thought is not
> needed before”. Userfaultfd’s feature bits are already nuts, IMHO.
> 
> > It won't help either even if this flag is only used for the follow up
> > WRITE_HINT (since then there'll be a CoW) because when WRITE_HINT attached
> > it doesn't make sense to not have ACCESS_HINT, then it seems the WRITE_HINT
> > itself would be enough for ZEROPAGE to me.
> 
> Agreed. Again, I think it is worthy for consistency.

I'd be fine if it's kernel internal flags only.  But this is solid kernel
ABI.  Are you.. sure?

We're literally trying to introduce some flags just for "consistency" even
if we know nobody will be using it.  It really dosn't sound very right on
designing good interfaces..

> 
> > [...]
> > 
> >> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> >> index 421784d26651..c15679f3eb6a 100644
> >> --- a/mm/userfaultfd.c
> >> +++ b/mm/userfaultfd.c
> >> @@ -65,6 +65,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> >>      bool writable = dst_vma->vm_flags & VM_WRITE;
> >>      bool vm_shared = dst_vma->vm_flags & VM_SHARED;
> >>      bool page_in_cache = page->mapping;
> >> +     bool prefault = !(uffd_flags & UFFD_FLAGS_ACCESS_LIKELY);
> > 
> > I think it's okay to name it "prefault" as a temp var, but ideally IMHO we
> > shouldn't assume what the user app is doing - it is only installing some
> > uffd pgtables with !ACCESS_LIKELY and it does not necessarily need to be a
> > prefault process..
> > 
> >>      spinlock_t *ptl;
> >>      struct inode *inode;
> >>      pgoff_t offset, max_off;
> >> @@ -92,6 +93,11 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> >>               */
> >>              _dst_pte = pte_wrprotect(_dst_pte);
> >> 
> >> +     if (prefault && arch_wants_old_prefaulted_pte())
> >> +             _dst_pte = pte_mkold(_dst_pte);
> >> +     else
> >> +             _dst_pte = pte_sw_mkyoung(_dst_pte);
> > 
> > Could you explain why we couldn't unconditionally mkold here even for x86?
> 
> To answer this question and the previous one, please note that the logic is
> “borrowed” from do_set_pte(). If you want me to refactor and extract a
> function, please let me know.
> 
> Here is the deal: for x86, we don’t do pte_mkold() because setting the
> access bit is expensive (>500 cycles). For arm64 that have access-bit we
> don’t since (according to arm64 code or commit log), the cost of setting the
> access bit on arm is low.
> 
> > It'll be a pity if this feature bit will only be useful on arm64 but not
> > covering x86 (which is so far still the majority I think).
> > 
> > IMHO it's slightly different here comparing to kernel prefaults - the uesr
> > app may not be aware of kernel prefaults, but here !ACCESS_HINT it's
> > user-aware, and it's what user app explicitly provided.  IMO it's a
> > stronger proof of a cold page already.
> 
> I’m ok with that if that is your choice. I actually prefer to give userspace
> more control, but I tried to be consistent with other parts of the kernel.

Ah good to know, then if there's a vote I'll go for your proposal.

I'd suggest we make it a strong semantics.  We used to have similar
discussions around the MADV_COLLAPSE on whether it should be restricted to
khugepaged limitations.  I think it's similar here.

> Having said that, it’s really hard for me to see why young bit would be clear,
> but dirty bit would be set...

Assume one page has both young/dirty set, the reclaim code decides to age
this page, then.. young=0 && dirty=1?

> 
> > The other thing I got confused here is arch_wants_old_prefaulted_pte()
> > returns true if arm64 supports hardware AF.  However for all the rest archs
> > (including x86_64 which, afaict, support AF too in most models) it'll
> > constantly return false.  Do you know what's the rational behind?
> 
> All x86 (32/64) since 386 support access-bit in the page-tables (IIRC, 286
> had access bit in the segments).
> 
> I thought we discussed it before: if you access an old PTE on x86, you pay
> >500 cycles; this actually affected UnixBench when people tried to change
> this behavior [1]. In contrast, on arm64, which I have never profiled, you
> probably saw the comment saying: "Experimentally, it's cheap to set the
> access flag in hardware and we benefit from prefaulting mappings as 'old’ to
> start with.”.

Thanks.  I'm really curious how fast would aarch64 be on setting
hardware-assist young bit and why now.

> 
> I do not know what happens on other architectures.
> 
> ( sorry if I have some repetitions in this email )
> 
> [1] https://marc.info/?l=linux-kernel&m=146582237922378&w=2
> 

-- 
Peter Xu