Introduction -------------------------------- This series provides a mechanism for userspace to induce a collapse of eligible ranges of memory into transparent hugepages in process context, thus permitting users to more tightly control their own hugepage utilization policy at their own expense. This idea was previously introduced by David Rientjes, and thanks to everyone for your patience while I prepared these patches resulting from that discussion[1]. [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@xxxxxxxxxx/ Interface -------------------------------- The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and leverages the new process_madvise(2) call. (*) process_madvise(2) Performs a synchronous collapse of the native pages mapped by the list of iovecs into transparent hugepages. The default gfp flags used will be the same as those used at-fault for the VMA region(s) covered. When multiple VMA regions are spanned, if faulting-in memory from any VMA would permit synchronous compaction and reclaim, then all hugepage allocations required to satisfy the request may enter compaction and reclaim. Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored by default, as the user is explicitly requesting this action. Define two flags to control collapse semantics, passed through process_madvise(2)’s optional flags parameter: MADV_F_COLLAPSE_LIMITS If supplied, collapse respects pte collapse limits set via sysfs: /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]. Required if calling on behalf of another process and not CAP_SYS_ADMIN. MADV_F_COLLAPSE_DEFRAG If supplied, permit synchronous compaction and reclaim, regardless of VMA flags. (*) madvise(2) Equivalent to process_madvise(2) on self, with no flags passed; pte collapse limits are ignored, and the gfp flags will be the same as those used at-fault for the VMA region(s) covered. Note that, users wanting different collapse semantics can always use process_madvise(2) on themselves. Discussion -------------------------------- The mechanism is fully compatible with khugepaged, allowing userspace to separately define synchronous and asynchronous hugepage policies, as priority dictates. It also naturally permits a DAMON scheme, DAMOS_COLLAPSE, to make efficient use of the available hugepages on the system by backing the most frequently accessed memory by hugepages[2]. Though not required to justify this series, hugepage management could be offloaded entirely to a sufficiently informed userspace agent, supplanting the need for khugepaged in the kernel. Along with the interface, this series proposes a batched implementation to collapse a range of memory. The motivation for this is to limit contention on mmap_lock, doing multiple page table modifications while the lock is held exclusively. Only private anonymous memory is supported by this series. File-backed memory support will be added later. Multiple hugepages support (such as 1 GiB gigantic hugepages) were not considered at this time, but could be supported by the flags parameter in the future. kselftests were omitted from this series for brevity, but would be included in an eventual patch submission. [2] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@xxxxxxxxxx/T/ Sequence of Patches -------------------------------- Patches 1-10 perform refactoring of collapse logic within khugepaged.c: introducing the notion of a collapse context and isolating logic that can be reused later in the series for the madvise collapse context. Patches 11-14 introduce logic for the proposed madvise collapse mechanism. Patch 11 adds madvise and header file plumbing. Patch 12 and 13, separately, add the core collapse logic, with the former introducing the overall batched approach and locking strategy, and the latter fills-in batch action details. This separation was purely to keep patch size down. Patch 14 adds process_madvise support. Applies against next-20220308. Zach O'Keefe (14): mm/rmap: add mm_find_pmd_raw helper mm/khugepaged: add struct collapse_control mm/khugepaged: add __do_collapse_huge_page() helper mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse mm/khugepaged: add mmap_assert_locked() checks to scan_pmd() mm/khugepaged: add hugepage_vma_revalidate_pmd_count() mm/khugepaged: add vm_flags_ignore to hugepage_vma_revalidate_pmd_count() mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled() mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP mm/khugepaged: rename khugepaged-specific/not functions mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse mm/madvise: add __madvise_collapse_*_batch() actions. mm/madvise: add process_madvise(MADV_COLLAPSE) fs/io_uring.c | 3 +- include/linux/huge_mm.h | 27 +- include/linux/mm.h | 3 +- include/uapi/asm-generic/mman-common.h | 10 + mm/huge_memory.c | 2 +- mm/internal.h | 1 + mm/khugepaged.c | 937 ++++++++++++++++++++----- mm/madvise.c | 45 +- mm/memory.c | 6 +- mm/rmap.c | 15 +- 10 files changed, 842 insertions(+), 207 deletions(-) -- 2.35.1.616.g0bdcbb4464-goog