Re: [RFC PATCH 00/14] mm: userspace hugepage collapse

Michal Hocko <mhocko@xxxxxxxx> · Mon, 21 Mar 2022 15:37:59 +0100

[ Removed  Richard Henderson from the CC list as the delivery fails for
  his address]
On Tue 08-03-22 13:34:03, Zach O'Keefe wrote:
> Introduction
> --------------------------------
> 
> This series provides a mechanism for userspace to induce a collapse of
> eligible ranges of memory into transparent hugepages in process context,
> thus permitting users to more tightly control their own hugepage
> utilization policy at their own expense.
> 
> This idea was previously introduced by David Rientjes, and thanks to
> everyone for your patience while I prepared these patches resulting from
> that discussion[1].
> 
> [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@xxxxxxxxxx/
> 
> Interface
> --------------------------------
> 
> The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> leverages the new process_madvise(2) call.
> 
> (*) process_madvise(2)
> 
>         Performs a synchronous collapse of the native pages mapped by
>         the list of iovecs into transparent hugepages. The default gfp
>         flags used will be the same as those used at-fault for the VMA
>         region(s) covered.

Could you expand on reasoning here? The default allocation mode for #PF
is rather light. Madvised will try harder. The reasoning is that we want
to make stalls due to #PF as small as possible and only try harder for
madvised areas (also a subject of configuration). Wouldn't it make more
sense to try harder for an explicit calls like madvise?

>	  When multiple VMA regions are spanned, if
>         faulting-in memory from any VMA would permit synchronous
>         compaction and reclaim, then all hugepage allocations required
>         to satisfy the request may enter compaction and reclaim.

I am not sure I follow here. Let's have a memory range spanning two
vmas, one with MADV_HUGEPAGE.

>         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
>         by default, as the user is explicitly requesting this action.
>         Define two flags to control collapse semantics, passed through
>         process_madvise(2)’s optional flags parameter:

This part is discussed later in the thread.

> 
>         MADV_F_COLLAPSE_LIMITS
> 
>         If supplied, collapse respects pte collapse limits set via
>         sysfs:
>         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
>         Required if calling on behalf of another process and not
>         CAP_SYS_ADMIN.
> 
>         MADV_F_COLLAPSE_DEFRAG
> 
>         If supplied, permit synchronous compaction and reclaim,
>         regardless of VMA flags.

Why do we need this?
-- 
Michal Hocko
SUSE Labs