Introduction -------------------------------- This series provides a mechanism for userspace to induce a collapse of eligible ranges of memory into transparent hugepages in process context, thus permitting users to more tightly control their own hugepage utilization policy at their own expense. This idea was introduced by David Rientjes[1], and the semantics and implementation were introduced and discussed in a previous PATCH RFC[2]. Interface -------------------------------- The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and leverages the new process_madvise(2) call. (*) process_madvise(2) Performs a synchronous collapse of the native pages mapped by the list of iovecs into transparent hugepages. Allocation semantics are the same as khugepaged, and depend on (1) the active sysfs settings /sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/khugepaged/defrag, and (2) the VMA flags of the memory range being collapsed. Collapse eligibility criteria differs from khugepaged in that the sysfs files /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_[none|swap|shared] are ignored. When a range spans multiple hugepage-aligned/sized regions, the semantics of the collapse of each region is independent from the others. Caller must have CAP_SYS_ADMIN if not acting on self. Return value follows existing process_madvise(2) conventions. A “success” indicates that all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already pmd-mapped THPs. (*) madvise(2) Equivalent to process_madvise(2) on self, with 0 returned on “success”. Future work -------------------------------- Only private anonymous memory is supported by this series. File and shmem memory support will be added later. One possible user of this functionality is a userspace agent that attempts to optimize THP utilization system-wide by allocating THPs based on, for example, task priority, task performance requirements, or heatmaps. For the latter, one idea that has already surfaced is using DAMON to identify hot regions, and driving THP collapse through a new DAMOS_COLLAPSE scheme[3]. Sequence of Patches -------------------------------- Patches 1-4 perform refactoring of collapse logic within khugepaged.c and introduce the notion of a collapse context. Patches 5-9 introduces MADV_COLLAPSE, does some renaming, adds support so that MADV_COLLAPSE context has the eligibility and allocation semantics referenced above, and adds process_madivse(2) support. Patches 10-12 add selftests to test the new functionality. Applies against next-20220408. [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@xxxxxxxxxx/ [2] https://lore.kernel.org/linux-mm/20220308213417.1407042-1-zokeefe@xxxxxxxxxx/ [3] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@xxxxxxxxxx/T/ Zach O'Keefe (13): mm/khugepaged: separate hugepage preallocation and cleanup mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP mm/khugepaged: add struct collapse_control mm/khugepaged: make hugepage allocation context-specific mm/khugepaged: add struct collapse_result mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse mm/khugepaged: remove khugepaged prefix from shared collapse functions mm/khugepaged: add flag to ignore khugepaged_max_ptes_* mm/khugepaged: add flag to ignore page young/referenced requirement mm/madvise: add MADV_COLLAPSE to process_madvise() selftests/vm: modularize collapse selftests selftests/vm: add MADV_COLLAPSE collapse context to selftests selftests/vm: add test to verify recollapse of THPs include/linux/huge_mm.h | 12 + include/trace/events/huge_memory.h | 5 +- include/uapi/asm-generic/mman-common.h | 2 + mm/internal.h | 1 + mm/khugepaged.c | 598 ++++++++++++++++-------- mm/madvise.c | 11 +- mm/rmap.c | 15 +- tools/testing/selftests/vm/khugepaged.c | 417 +++++++++++------ 8 files changed, 702 insertions(+), 359 deletions(-) -- 2.35.1.1178.g4f1659d476-goog