Re: [RFC PATCH 07/14] mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count()

"Zach O'Keefe" <zokeefe@xxxxxxxxxx> · Thu, 10 Mar 2022 12:24:35 -0800



On Thu, Mar 10, 2022 at 11:54 AM Yang Shi <shy828301@xxxxxxxxx> wrote:
>
> On Thu, Mar 10, 2022 at 10:46 AM David Rientjes <rientjes@xxxxxxxxxx> wrote:
> >
> > On Thu, 10 Mar 2022, Yang Shi wrote:
> >
> > > > This separates "async-hint" vs "sync-explicit" madvise requests.
> > > > MADV_[NO]HUGEPAGE are hints, and together with thp settings, advise
> > > > the kernel how to treat memory in the future. The kernel uses
> > > > VM_[NO]HUGEPAGE to aid with this. MADV_COLLAPSE, as an explicit
> > > > request, is free to define its own defrag semantics.
> > > >
> > > > This would allow flexibility to separately define async vs sync thp
> > > > policies. For example, highly tuned userspace applications that are
> > > > sensitive to unexpected latency might want to manage their hugepages
> > > > utilization themselves, and ask khugepaged to stay away. There is no
> > > > way in "always" mode to do this without setting VM_NOHUGEPAGE.
> > >
> > > I don't quite get why you set THP to always but don't want to
> > > khugepaged do its job. It may be slow, I think this is why you
> > > introduce MADV_COLLAPSE, right? But it doesn't mean khugepaged can't
> > > scan the same area, it just doesn't do any real work and waste some
> > > cpu cycles. But I guess MADV_COLLAPSE doesn't prevent the PMD/THP from
> > > being split, right? So khugepaged still plays a role to re-collapse
> > > the area without calling MADV_COLLAPSE over again and again.
> > >
> >
> > My only real concern for MADV_COLLAPSE was when the span being collapsed
> > includes a mixture of both VM_HUGEPAGE and VM_NOHUGEPAGE.  Does this
> > collapse over the eligible memory or does it fail entirely?
> >
> > I'd think it was the former, that we should respect VM_NOHUGEPAGE and only
> > collapse eligible memory when doing MADV_COLLAPSE but now userspace
> > struggles to know whether it was a partial collapse because of
> > ineligiblity or because we just couldn't allocate a hugepage.
>
> Yes, I agree we should just try to collapse eligible vmas.
>
> Since we are using madvise, we'd better follow its convention. We
> could return different values for different failures, for example:
> 1. All vmas are collapsed successfully, return 0 (success)
> 2. Run into ineligible vma, return -EINVAL
> 3. Can't allocate hugepage, return -ENOMEM
>
> Or just simply return 0 for success or a single error code for all
> failure cases.
>

Different codes has a benefit (assuming -EINVAL takes precedence over
-EAGAIN (AFAIK madvise convention for mem not available)): A lazy user
wouldn't need to read smaps if -EAGAIN, they could just reissue the
syscall again over the same range, at a later time.

> >
> > It has the information to figure this out on its own, so given the use of
> > VM_NOHUGEPAGE for non-MADV_NOHUGEPAGE purposes, I think it makes sense to
> > simply ignore these vmas as part of the collapse request.