Re: [RFC PATCH 00/14] mm: userspace hugepage collapse

Michal Hocko <mhocko@xxxxxxxx> · Tue, 29 Mar 2022 14:24:56 +0200

On Tue 22-03-22 08:53:35, Zach O'Keefe wrote:
> On Tue, Mar 22, 2022 at 5:11 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> >
> > On Mon 21-03-22 08:46:35, Zach O'Keefe wrote:
> > > Hey Michal, thanks for taking the time to review / comment.
> > >
> > > On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > > >
> > > > [ Removed  Richard Henderson from the CC list as the delivery fails for
> > > >   his address]
> > >
> > > Thank you :)
> > >
> > > > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> > > > > Introduction
> > > > > --------------------------------
> > > > >
> > > > > This series provides a mechanism for userspace to induce a collapse of
> > > > > eligible ranges of memory into transparent hugepages in process context,
> > > > > thus permitting users to more tightly control their own hugepage
> > > > > utilization policy at their own expense.
> > > > >
> > > > > This idea was previously introduced by David Rientjes, and thanks to
> > > > > everyone for your patience while I prepared these patches resulting from
> > > > > that discussion[1].
> > > > >
> > > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@xxxxxxxxxx/
> > > > >
> > > > > Interface
> > > > > --------------------------------
> > > > >
> > > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> > > > > leverages the new process_madvise(2) call.
> > > > >
> > > > > (*) process_madvise(2)
> > > > >
> > > > >         Performs a synchronous collapse of the native pages mapped by
> > > > >         the list of iovecs into transparent hugepages. The default gfp
> > > > >         flags used will be the same as those used at-fault for the VMA
> > > > >         region(s) covered.
> > > >
> > > > Could you expand on reasoning here? The default allocation mode for #PF
> > > > is rather light. Madvised will try harder. The reasoning is that we want
> > > > to make stalls due to #PF as small as possible and only try harder for
> > > > madvised areas (also a subject of configuration). Wouldn't it make more
> > > > sense to try harder for an explicit calls like madvise?
> > > >
> > >
> > > The reasoning is that the user has presumably configured system/vmas
> > > to tell the kernel how badly they want thps, and so this call aligns
> > > with current expectations. I.e. a user who goes about the trouble of
> > > trying to fault-in a thp at a given memory address likely wants a thp
> > > "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a
> > > thp.
> >
> > If the syscall tries only as hard as the #PF doesn't that limit the
> > functionality?
> 
> I'd argue that, the various allocation semantics possible through
> existing thp knobs / vma flags, in addition to the proposed
> MADV_F_COLLAPSE_DEFRAG flag provides a flexible functional space to
> work with. Relatively speaking, in what way would we be lacking
> functionality?

Flexibility is definitely a plus but look at our existing configuration
space and try to wrap your head around that.

> > I mean a non #PF can consume more resources to allocate
> > and collapse a THP as it won't inflict any measurable latency to the
> > targetting process (except for potential CPU contention).
> 
> Sorry, I'm not sure I understand this. What latency are we discussing
> in this point? Do you mean to say that since MADV_COLLAPSE isn't in
> the fault path, it doesn't necessarily need to be fast / direct
> reclaim wouldn't be noticed?

Exactly. Same as khugepaged. I would even argue that khugepaged and
madvise would better behave consistently because in both cases it is a
remote operation to create THPs. One triggered automatically the other
explicitly requested by the userspace. Having a third mode (for madvise)
would add more to the configuration space and a thus a complexity.
[...]
> > > Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both?
> > >
> > > * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of
> > > inter-process protection for collapsing memory in another process'
> > > address space (which a malevolent program could exploit to cause oom
> > > conditions in another memcg hierarchy, for example), but we want
> > > privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp
> > > utilization as they wish.
> >
> > Could you expand some more please? How is this any different from
> > khugepaged (well, except that you can trigger the collapsing explicitly
> > rather than rely on khugepaged to find that mm)?
> >
> 
> MADV_F_COLLAPSE_LIMITS was motivated by being able to replicate &
> extend khugepaged in userspace, where the benefit is precisely that we
> can choose that mm/vma more intelligently.

Could you elaborate some more?

> > > * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want
> > > to explicitly tell the kernel to try harder to back this by thps,
> > > regardless of the current system/vma configuration.
> > >
> > > Note that when used together, these flags can be used to implement the
> > > exact behavior of khugepaged, through MADV_COLLAPSE.
> >
> > IMHO this is stretching the interface and this can backfire in the
> > future. The interface should be really trivial. I want to collapse a
> > memory area. Let the kernel do the right thing and do not bother with
> > all the implementation details. I would use the same allocation strategy
> > as khugepaged as this seems to be closesest from the latency and
> > application awareness POV. In a way you can look at the madvise call as
> > a way to trigger khugepaged functionality on he particular memory range.
> 
> Trying to summarize a few earlier comments centering around
> MADV_F_COLLAPSE_DEFRAG and allocation semantics.
> 
> This series presupposes the existence of an informed userspace agent
> that is aware of what processes/memory ranges would benefit most from
> thps. Such an agent might either be:
> (1) A system-level daemon optimizing thp utilization system-wide
> (2) A highly tuned process / malloc implementation optimizing their
> own thp usage
> 
> The different types of agents reflects the divide between #PF and
> DEFRAG semantics.
> 
> For (1), we want to view this exactly like triggering khugepaged
> functionality from userspace, and likely want DEFRAG semantics.
> 
> For (2), I was viewing this as the "live" symmetric counterpart to
> at-fault thp allocation where the process has decided, at runtime,
> that this memory could benefit from thp backing, and so #PF semantics
> seemed like sane default. I'd worry that using DEFRAG semantics by
> default might deter adoption by users who might not be willing to wait
> an unbounded amount of time for direct reclaim.

This time is not really unbound. THP even in the defrag mode doesn't
even try to be as hard as e.g. hugetlb allocations.

For your 2) category I am not really sure I see the point. Why would
you want to rely on madvise in a lightweight allocation mode when this
has been already done during the #PF time. If an application really
knows it wants to use THP then madvise(MADV_HUGEPAGE) would be the first
thing to do. This would already tell #PF to try a bit harder in some
configurations and khugepaged knows that collapsing memory makes sense.

That being said I would be really careful to provide an extended
interface to control how hard to try to allocate a THP. This has a high
risk of externalizing internal implementation details about how the
compaction works. Unless we have a strong real life usecase I would go
with the khugepaged semantic initially. Maybe we will learn about future
usecases where a very lightweight allocation mode is required but that
can be added later on. The simpler the interface is initially the
better.

Thanks!
-- 
Michal Hocko
SUSE Labs