[RFC PATCH 00/14] mm: userspace hugepage collapse

"Zach O'Keefe" <zokeefe@xxxxxxxxxx> · Tue, 8 Mar 2022 13:34:03 -0800

Introduction
--------------------------------

This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.

This idea was previously introduced by David Rientjes, and thanks to
everyone for your patience while I prepared these patches resulting from
that discussion[1].

[1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@xxxxxxxxxx/

Interface
--------------------------------

The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.

(*) process_madvise(2)

        Performs a synchronous collapse of the native pages mapped by
        the list of iovecs into transparent hugepages. The default gfp
        flags used will be the same as those used at-fault for the VMA
        region(s) covered. When multiple VMA regions are spanned, if
        faulting-in memory from any VMA would permit synchronous
        compaction and reclaim, then all hugepage allocations required
        to satisfy the request may enter compaction and reclaim.
        Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
        by default, as the user is explicitly requesting this action.
        Define two flags to control collapse semantics, passed through
        process_madvise(2)’s optional flags parameter:

        MADV_F_COLLAPSE_LIMITS

        If supplied, collapse respects pte collapse limits set via
        sysfs:
        /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
        Required if calling on behalf of another process and not
        CAP_SYS_ADMIN.

        MADV_F_COLLAPSE_DEFRAG

        If supplied, permit synchronous compaction and reclaim,
        regardless of VMA flags.

(*) madvise(2)

        Equivalent to process_madvise(2) on self, with no flags
        passed; pte collapse limits are ignored, and the gfp flags will
        be the same as those used at-fault for the VMA region(s)
        covered. Note that, users wanting different collapse semantics
        can always use process_madvise(2) on themselves.

Discussion
--------------------------------

The mechanism is fully compatible with khugepaged, allowing userspace to
separately define synchronous and asynchronous hugepage policies, as
priority dictates. It also naturally permits a DAMON scheme,
DAMOS_COLLAPSE, to make efficient use of the available hugepages on the
system by backing the most frequently accessed memory by hugepages[2].
Though not required to justify this series, hugepage management could be
offloaded entirely to a sufficiently informed userspace agent,
supplanting the need for khugepaged in the kernel.

Along with the interface, this series proposes a batched implementation
to collapse a range of memory. The motivation for this is to limit
contention on mmap_lock, doing multiple page table modifications while
the lock is held exclusively.

Only private anonymous memory is supported by this series. File-backed
memory support will be added later.

Multiple hugepages support (such as 1 GiB gigantic hugepages) were not
considered at this time, but could be supported by the flags parameter
in the future.

kselftests were omitted from this series for brevity, but would be
included in an eventual patch submission.

[2] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@xxxxxxxxxx/T/

Sequence of Patches
--------------------------------

Patches 1-10 perform refactoring of collapse logic within khugepaged.c:
introducing the notion of a collapse context and isolating logic that
can be reused later in the series for the madvise collapse context.

Patches 11-14 introduce logic for the proposed madvise collapse
mechanism. Patch 11 adds madvise and header file plumbing. Patch 12 and
13, separately, add the core collapse logic, with the former introducing
the overall batched approach and locking strategy, and the latter
fills-in batch action details. This separation was purely to keep patch
size down. Patch 14 adds process_madvise support.

Applies against next-20220308.

Zach O'Keefe (14):
  mm/rmap: add mm_find_pmd_raw helper
  mm/khugepaged: add struct collapse_control
  mm/khugepaged: add __do_collapse_huge_page() helper
  mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse
  mm/khugepaged: add mmap_assert_locked() checks to scan_pmd()
  mm/khugepaged: add hugepage_vma_revalidate_pmd_count()
  mm/khugepaged: add vm_flags_ignore to
    hugepage_vma_revalidate_pmd_count()
  mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled()
  mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
  mm/khugepaged: rename khugepaged-specific/not functions
  mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
  mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
  mm/madvise: add __madvise_collapse_*_batch() actions.
  mm/madvise: add process_madvise(MADV_COLLAPSE)

 fs/io_uring.c                          |   3 +-
 include/linux/huge_mm.h                |  27 +-
 include/linux/mm.h                     |   3 +-
 include/uapi/asm-generic/mman-common.h |  10 +
 mm/huge_memory.c                       |   2 +-
 mm/internal.h                          |   1 +
 mm/khugepaged.c                        | 937 ++++++++++++++++++++-----
 mm/madvise.c                           |  45 +-
 mm/memory.c                            |   6 +-
 mm/rmap.c                              |  15 +-
 10 files changed, 842 insertions(+), 207 deletions(-)

-- 
2.35.1.616.g0bdcbb4464-goog