Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance

Yu Zhao <yuzhao@xxxxxxxxxx> · Mon, 17 Jul 2023 14:35:47 -0600

On Mon, Jul 17, 2023 at 1:31 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
>
> On Mon, Jul 17, 2023 at 7:36 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> >
> > >>>> +static int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> > >>>> +{
> > >>>> +       int i;
> > >>>> +       gfp_t gfp;
> > >>>> +       pte_t *pte;
> > >>>> +       unsigned long addr;
> > >>>> +       struct vm_area_struct *vma = vmf->vma;
> > >>>> +       int prefer = anon_folio_order(vma);
> > >>>> +       int orders[] = {
> > >>>> +               prefer,
> > >>>> +               prefer > PAGE_ALLOC_COSTLY_ORDER ? PAGE_ALLOC_COSTLY_ORDER : 0,
> > >>>> +               0,
> > >>>> +       };
> > >>>> +
> > >>>> +       *folio = NULL;
> > >>>> +
> > >>>> +       if (vmf_orig_pte_uffd_wp(vmf))
> > >>>> +               goto fallback;
> > >>>> +
> > >>>> +       for (i = 0; orders[i]; i++) {
> > >>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > >>>> +               if (addr >= vma->vm_start &&
> > >>>> +                   addr + (PAGE_SIZE << orders[i]) <= vma->vm_end)
> > >>>> +                       break;
> > >>>> +       }
> > >>>> +
> > >>>> +       if (!orders[i])
> > >>>> +               goto fallback;
> > >>>> +
> > >>>> +       pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> > >>>> +       if (!pte)
> > >>>> +               return -EAGAIN;
> > >>>
> > >>> It would be a bug if this happens. So probably -EINVAL?
> > >>
> > >> Not sure what you mean? Hugh Dickins' series that went into v6.5-rc1 makes it
> > >> possible for pte_offset_map() to fail (if I understood correctly) and we have to
> > >> handle this. The intent is that we will return from the fault without making any
> > >> change, then we will refault and try again.
> > >
> > > Thanks for checking that -- it's very relevant. One detail is that
> > > that series doesn't affect anon. IOW, collapsing PTEs into a PMD can't
> > > happen while we are holding mmap_lock for read here, and therefore,
> > > the race that could cause pte_offset_map() on shmem/file PTEs to fail
> > > doesn't apply here.
> >
> > But Hugh's patches have changed do_anonymous_page() to handle failure from
> > pte_offset_map_lock(). So I was just following that pattern. If this really
> > can't happen, then I'd rather WARN/BUG on it, and simplify alloc_anon_folio()'s
> > prototype to just return a `struct folio *` (and if it's null that means ENOMEM).
> >
> > Hugh, perhaps you can comment?
> >
> > As an aside, it was my understanding from LWN, that we are now using a per-VMA
> > lock so presumably we don't hold mmap_lock for read here? Or perhaps that only
> > applies to file-backed memory?
>
> For anon under mmap_lock for read:
> 1. pte_offset_map[_lock]() fails when a parallel PF changes PMD from
> none to leaf.
> 2. changing PMD from non-leaf to leaf is a bug. See the comments in
> the "else" branch in handle_pte_fault().
>
> So for do_anonymous_page(), there is only one case
> pte_offset_map[_lock]() can fail.

===
> For the code above, this case was
> ruled out by vmf_orig_pte_uffd_wp().

Actually I was wrong about this part.
===

> Checking the return value from pte_offset_map[_lock]() is a good
> practice. What I'm saying is that -EAGAIN would mislead people to
> think, in our case, !pte is legitimate, and hence the suggestion of
> replacing it with -EINVAL.

Yes, -EAGAIN is suitable.

> No BUG_ON() please. As I've previously mentioned, it's against
> Documentation/process/coding-style.rst.
>
> > > +Hugh Dickins for further consultation if you need it.
> > >
> > >>>> +
> > >>>> +       for (; orders[i]; i++) {
> > >>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > >>>> +               vmf->pte = pte + pte_index(addr);
> > >>>> +               if (!vmf_pte_range_changed(vmf, 1 << orders[i]))
> > >>>> +                       break;
> > >>>> +       }
> > >>>> +
> > >>>> +       vmf->pte = NULL;
> > >>>> +       pte_unmap(pte);
> > >>>> +
> > >>>> +       gfp = vma_thp_gfp_mask(vma);
> > >>>> +
> > >>>> +       for (; orders[i]; i++) {
> > >>>> +               addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << orders[i]);
> > >>>> +               *folio = vma_alloc_folio(gfp, orders[i], vma, addr, true);
> > >>>> +               if (*folio) {
> > >>>> +                       clear_huge_page(&(*folio)->page, addr, 1 << orders[i]);
> > >>>> +                       return 0;
> > >>>> +               }
> > >>>> +       }
> > >>>> +
> > >>>> +fallback:
> > >>>> +       *folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> > >>>> +       return *folio ? 0 : -ENOMEM;
> > >>>> +}
> > >>>> +#else
> > >>>> +static inline int alloc_anon_folio(struct vm_fault *vmf, struct folio **folio)
> > >>>
> > >>> Drop "inline" (it doesn't do anything in .c).
> > >>
> > >> There are 38 instances of inline in memory.c alone, so looks like a well used
> > >> convention, even if the compiler may choose to ignore. Perhaps you can educate
> > >> me; what's the benefit of dropping it?
> > >
> > > I'll let Willy and Andrew educate both of us :)
> > >
> > > +Matthew Wilcox +Andrew Morton please. Thank you.
> > >
> > >>> The rest looks good to me.
> > >>
> > >> Great - just incase it wasn't obvious, I decided not to overwrite vmf->address
> > >> with the aligned version, as you suggested
> > >
> > > Yes, I've noticed. Not overwriting has its own merits for sure.
> > >
> > >> for 2 reasons; 1) address is const
> > >> in the struct, so would have had to change that. 2) there is a uffd path that
> > >> can be taken after the vmf->address fixup would have occured and the path
> > >> consumes that member, so it would have had to be un-fixed-up making it more
> > >> messy than the way I opted for.
> > >>
> > >> Thanks for the quick review as always!