Re: [RFC PATCH 00/14] mm: userspace hugepage collapse

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 8, 2022 at 1:34 PM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote:
>
> Introduction
> --------------------------------
>
> This series provides a mechanism for userspace to induce a collapse of
> eligible ranges of memory into transparent hugepages in process context,
> thus permitting users to more tightly control their own hugepage
> utilization policy at their own expense.
>
> This idea was previously introduced by David Rientjes, and thanks to
> everyone for your patience while I prepared these patches resulting from
> that discussion[1].
>
> [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@xxxxxxxxxx/
>
> Interface
> --------------------------------
>
> The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
> leverages the new process_madvise(2) call.
>
> (*) process_madvise(2)
>
>         Performs a synchronous collapse of the native pages mapped by
>         the list of iovecs into transparent hugepages. The default gfp
>         flags used will be the same as those used at-fault for the VMA
>         region(s) covered. When multiple VMA regions are spanned, if
>         faulting-in memory from any VMA would permit synchronous
>         compaction and reclaim, then all hugepage allocations required
>         to satisfy the request may enter compaction and reclaim.
>         Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored
>         by default, as the user is explicitly requesting this action.
>         Define two flags to control collapse semantics, passed through
>         process_madvise(2)’s optional flags parameter:
>
>         MADV_F_COLLAPSE_LIMITS
>
>         If supplied, collapse respects pte collapse limits set via
>         sysfs:
>         /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared].
>         Required if calling on behalf of another process and not
>         CAP_SYS_ADMIN.
>
>         MADV_F_COLLAPSE_DEFRAG
>
>         If supplied, permit synchronous compaction and reclaim,
>         regardless of VMA flags.
>
> (*) madvise(2)
>
>         Equivalent to process_madvise(2) on self, with no flags
>         passed; pte collapse limits are ignored, and the gfp flags will
>         be the same as those used at-fault for the VMA region(s)
>         covered. Note that, users wanting different collapse semantics
>         can always use process_madvise(2) on themselves.
>
> Discussion
> --------------------------------
>
> The mechanism is fully compatible with khugepaged, allowing userspace to
> separately define synchronous and asynchronous hugepage policies, as
> priority dictates. It also naturally permits a DAMON scheme,
> DAMOS_COLLAPSE, to make efficient use of the available hugepages on the
> system by backing the most frequently accessed memory by hugepages[2].
> Though not required to justify this series, hugepage management could be
> offloaded entirely to a sufficiently informed userspace agent,
> supplanting the need for khugepaged in the kernel.
>
> Along with the interface, this series proposes a batched implementation
> to collapse a range of memory. The motivation for this is to limit
> contention on mmap_lock, doing multiple page table modifications while
> the lock is held exclusively.
>
> Only private anonymous memory is supported by this series. File-backed
> memory support will be added later.
>
> Multiple hugepages support (such as 1 GiB gigantic hugepages) were not
> considered at this time, but could be supported by the flags parameter
> in the future.
>
> kselftests were omitted from this series for brevity, but would be
> included in an eventual patch submission.
>
> [2] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@xxxxxxxxxx/T/
>
> Sequence of Patches
> --------------------------------
>
> Patches 1-10 perform refactoring of collapse logic within khugepaged.c:
> introducing the notion of a collapse context and isolating logic that
> can be reused later in the series for the madvise collapse context.
>
> Patches 11-14 introduce logic for the proposed madvise collapse
> mechanism. Patch 11 adds madvise and header file plumbing. Patch 12 and
> 13, separately, add the core collapse logic, with the former introducing
> the overall batched approach and locking strategy, and the latter
> fills-in batch action details. This separation was purely to keep patch
> size down. Patch 14 adds process_madvise support.
>
> Applies against next-20220308.
>
> Zach O'Keefe (14):
>   mm/rmap: add mm_find_pmd_raw helper
>   mm/khugepaged: add struct collapse_control
>   mm/khugepaged: add __do_collapse_huge_page() helper
>   mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse
>   mm/khugepaged: add mmap_assert_locked() checks to scan_pmd()
>   mm/khugepaged: add hugepage_vma_revalidate_pmd_count()
>   mm/khugepaged: add vm_flags_ignore to
>     hugepage_vma_revalidate_pmd_count()
>   mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled()
>   mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP
>   mm/khugepaged: rename khugepaged-specific/not functions
>   mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
>   mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse
>   mm/madvise: add __madvise_collapse_*_batch() actions.
>   mm/madvise: add process_madvise(MADV_COLLAPSE)
>
>  fs/io_uring.c                          |   3 +-
>  include/linux/huge_mm.h                |  27 +-
>  include/linux/mm.h                     |   3 +-
>  include/uapi/asm-generic/mman-common.h |  10 +
>  mm/huge_memory.c                       |   2 +-
>  mm/internal.h                          |   1 +
>  mm/khugepaged.c                        | 937 ++++++++++++++++++++-----
>  mm/madvise.c                           |  45 +-
>  mm/memory.c                            |   6 +-
>  mm/rmap.c                              |  15 +-
>  10 files changed, 842 insertions(+), 207 deletions(-)
>
> --
> 2.35.1.616.g0bdcbb4464-goog
>

Thanks to the many people who took the time to review and provide
feedback on this series.

In preparation of a V1 PATCH series which will incorporate the
feedback received here, one item I'd specifically like feedback from
the community on is whether support for privately-mapped anonymous
memory is sufficient to motivate an initial landing of MADV_COLLAPSE,
with file-backed support coming later. I have local patches to support
file-backed memory, but my thought was to keep the series no longer
than necessary, for the consideration of reviewers. No substantial
infrastructure changes are required to support file-backed memory; it
naturally builds on top of the existing series (as it was developed
with file-backed support flushed-out).

Thanks,
Zach





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux