On Tue, Mar 8, 2022 at 1:34 PM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote: > > Introduction > -------------------------------- > > This series provides a mechanism for userspace to induce a collapse of > eligible ranges of memory into transparent hugepages in process context, > thus permitting users to more tightly control their own hugepage > utilization policy at their own expense. > > This idea was previously introduced by David Rientjes, and thanks to > everyone for your patience while I prepared these patches resulting from > that discussion[1]. > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@xxxxxxxxxx/ > > Interface > -------------------------------- > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and > leverages the new process_madvise(2) call. > > (*) process_madvise(2) > > Performs a synchronous collapse of the native pages mapped by > the list of iovecs into transparent hugepages. The default gfp > flags used will be the same as those used at-fault for the VMA > region(s) covered. When multiple VMA regions are spanned, if > faulting-in memory from any VMA would permit synchronous > compaction and reclaim, then all hugepage allocations required > to satisfy the request may enter compaction and reclaim. > Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored > by default, as the user is explicitly requesting this action. > Define two flags to control collapse semantics, passed through > process_madvise(2)’s optional flags parameter: > > MADV_F_COLLAPSE_LIMITS > > If supplied, collapse respects pte collapse limits set via > sysfs: > /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]. > Required if calling on behalf of another process and not > CAP_SYS_ADMIN. > > MADV_F_COLLAPSE_DEFRAG > > If supplied, permit synchronous compaction and reclaim, > regardless of VMA flags. > > (*) madvise(2) > > Equivalent to process_madvise(2) on self, with no flags > passed; pte collapse limits are ignored, and the gfp flags will > be the same as those used at-fault for the VMA region(s) > covered. Note that, users wanting different collapse semantics > can always use process_madvise(2) on themselves. > > Discussion > -------------------------------- > > The mechanism is fully compatible with khugepaged, allowing userspace to > separately define synchronous and asynchronous hugepage policies, as > priority dictates. It also naturally permits a DAMON scheme, > DAMOS_COLLAPSE, to make efficient use of the available hugepages on the > system by backing the most frequently accessed memory by hugepages[2]. > Though not required to justify this series, hugepage management could be > offloaded entirely to a sufficiently informed userspace agent, > supplanting the need for khugepaged in the kernel. > > Along with the interface, this series proposes a batched implementation > to collapse a range of memory. The motivation for this is to limit > contention on mmap_lock, doing multiple page table modifications while > the lock is held exclusively. > > Only private anonymous memory is supported by this series. File-backed > memory support will be added later. > > Multiple hugepages support (such as 1 GiB gigantic hugepages) were not > considered at this time, but could be supported by the flags parameter > in the future. > > kselftests were omitted from this series for brevity, but would be > included in an eventual patch submission. > > [2] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@xxxxxxxxxx/T/ > > Sequence of Patches > -------------------------------- > > Patches 1-10 perform refactoring of collapse logic within khugepaged.c: > introducing the notion of a collapse context and isolating logic that > can be reused later in the series for the madvise collapse context. > > Patches 11-14 introduce logic for the proposed madvise collapse > mechanism. Patch 11 adds madvise and header file plumbing. Patch 12 and > 13, separately, add the core collapse logic, with the former introducing > the overall batched approach and locking strategy, and the latter > fills-in batch action details. This separation was purely to keep patch > size down. Patch 14 adds process_madvise support. > > Applies against next-20220308. > > Zach O'Keefe (14): > mm/rmap: add mm_find_pmd_raw helper > mm/khugepaged: add struct collapse_control > mm/khugepaged: add __do_collapse_huge_page() helper > mm/khugepaged: separate khugepaged_scan_pmd() scan and collapse > mm/khugepaged: add mmap_assert_locked() checks to scan_pmd() > mm/khugepaged: add hugepage_vma_revalidate_pmd_count() > mm/khugepaged: add vm_flags_ignore to > hugepage_vma_revalidate_pmd_count() > mm/thp: add madv_thp_vm_flags to __transparent_hugepage_enabled() > mm/khugepaged: record SCAN_PAGE_COMPOUND when scan_pmd() finds THP > mm/khugepaged: rename khugepaged-specific/not functions > mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse > mm/madvise: introduce batched madvise(MADV_COLLPASE) collapse > mm/madvise: add __madvise_collapse_*_batch() actions. > mm/madvise: add process_madvise(MADV_COLLAPSE) > > fs/io_uring.c | 3 +- > include/linux/huge_mm.h | 27 +- > include/linux/mm.h | 3 +- > include/uapi/asm-generic/mman-common.h | 10 + > mm/huge_memory.c | 2 +- > mm/internal.h | 1 + > mm/khugepaged.c | 937 ++++++++++++++++++++----- > mm/madvise.c | 45 +- > mm/memory.c | 6 +- > mm/rmap.c | 15 +- > 10 files changed, 842 insertions(+), 207 deletions(-) > > -- > 2.35.1.616.g0bdcbb4464-goog > Thanks to the many people who took the time to review and provide feedback on this series. In preparation of a V1 PATCH series which will incorporate the feedback received here, one item I'd specifically like feedback from the community on is whether support for privately-mapped anonymous memory is sufficient to motivate an initial landing of MADV_COLLAPSE, with file-backed support coming later. I have local patches to support file-backed memory, but my thought was to keep the series no longer than necessary, for the consideration of reviewers. No substantial infrastructure changes are required to support file-backed memory; it naturally builds on top of the existing series (as it was developed with file-backed support flushed-out). Thanks, Zach