Hey All, There are still a couple interface topics (capabilities for process_madvise(2), errnos) to iron out, but for the most part the behavior and semantics of MADV_COLLAPSE on anonymous memory seems to be ironed out. Thanks for everyone's time and effort contributing to that effort. Looking forward, I'd like to align on the semantics of file/shmem so seal MADV_COLLAPSE behavior. This is what I'd propose for an initial man-page-like description of MADV_COLLAPSE for madvise(2), to paint a full-picture view: ---8<--- Perform a best-effort synchronous collapse of the native pages mapped by the memory range into Transparent Hugepages (THPs). MADV_COLLAPSE operates on the current state of memory for the specified process and makes no persistent changes or guarantees on how pages will be mapped, constructed, or faulted in the future. However, for file/shmem memory, other mappings of this file extent may be queued and processed later by khugepaged to attempt to update their pagetables to map the hugepage by a PMD. If the ranges provided span multiple VMAs, the semantics of the collapse over each VMA is independent from the others. This implies a hugepage cannot cross a VMA boundary. If collapse of a given hugepage-aligned/sized region fails, the operation may continue to attempt collapsing the remainder of the specified memory. All non-resident pages covered by the range will first be swapped/faulted-in, before being copied onto a freshly allocated hugepage. If the native pages compose the same PTE-mapped hugepage, and are suitably aligned, the collapse may happen in-place. Unmapped pages will have their data directly initialized to 0 in the new hugepage. However, for every eligible hugepage aligned/sized region to-be collapsed, at least one page must currently be backed by memory. MADV_COLLAPSE is independent of any THP sysfs setting, both in terms of determining THP eligibility, and allocation semantics. The VMA must not be marked VM_NOHUGEPAGE, VM_HUGETLB**, VM_IO, VM_DONTEXPAND, VM_MIXEDMAP, or VM_PFNMAP, nor can it be stack memory or DAX-backed. The process must not have PR_SET_THP_DISABLE set. For file-backed memory, the file must either be (1) not open for write, and the mapping must be executable, or (2) the backing filesystem must support large pages. Allocation for the new hugepage may enter direct reclaim and/or compaction, regardless of VMA flags. When the system has multiple NUMA nodes, the hugepage will be allocated from the node providing the most native pages. If all hugepage-sized/aligned regions covered by the provided range were either successfully collapsed, or were already PMD-mapped THPs, this operation will be deemed successful. On successful return, all hugepage-aligned/sized memory regions provided will be mapped by PMDs. Note that this doesn’t guarantee anything about other possible mappings of the memory. Note that many failures might have occurred, since the operation may continue to collapse in the event collapse of a single hugepage-sized/aligned region fails. MADV_COLLAPSE is only available if the kernel was configured with CONFIGURE_TRANSPARENT_HUGEPAGE, and file/shmem support additionally require CONFIG_READ_ONLY_THP_FOR_FS and CONFIG_SHMEM. ---8<--- ** Might change with HugeTLB high-granularity mappings[1]. There are a few new items of note here: 1) PMD-mapped on success MADV_COLLAPSE ultimately wants memory mapped by PMDs, and so I propose we should always try to actually do the page table updates. For file/shmem, this means two things: (a) adding support to handle compound pages (both pte-mapped hugepages and non-HPAGE_PMD_ORDER compound pages), and (b) doing a final PMD install before returning, and not relying on subsequent fault. This makes the semantics of file/shmem the same as anonymous. I call out (a), since there was an existing debate about this, and so I want to ensure we are aligned[1]. Note that (b), along with presenting a consistent interface to users, also has real-world usecases too, where relying on fault is difficult (for example, shmem + UFFDIO_REGISTER_MODE_MINOR-managed memory). Also note that for (b), I'm proposing to only do the synchronous PMD install for the memory range provided - the page table collapse of other mappings of the memory can be deferred until later (by khugepaged). 2) folio timing && file non-writable, executable mapping I just want to align on some timing due to ongoing folio work. Currently, the requirement to be able to collapse file/shmem memory is that the file not be opened for write anywhere, and that the mapping is executable, but we'd eventually like to support filesystems that claim mapping_large_folio_support()[2]. Is it acceptable that future MADV_COLLAPSE works for either mapping_large_folio_support() or the old conditions? Alternatively, should MADV_COLLAPSE only support mapping_large_folio_support() filesystems from the onset? (I believe shmem and xfs are the only current users) 3) (shmem) sysfs settings and huge= tmpfs mount Should we ignore /sys/kernel/mm/transparent_hugepage/shmem_enabled, similar to how we ignore /sys/kernel/mm/transparent_hugepage/enabled for anon/file? Does that include "deny"? This choice is (partially) coupled with tmpfs huge= mount option. I think today, things work if we ignore this. However, I don't want to back us into a corner if we ever want to allow MADV_COLLAPSE to work on writeable shmem mappings one day (or any other incompatibility I'm unaware of). One option, if in (2) we chose to allow the old conditions, then we could ignore shmem_enabled in the non-writable, executable case - otherwise defer to "if the filesystem supports it", where we would then respect huge=. I think those are the important points. Am I missing anything? Thanks again everyone for taking the time to read and discuss, Best, Zach [1] https://lore.kernel.org/linux-mm/20220624173656.2033256-23-jthoughton@xxxxxxxxxx/ [2] https://lore.kernel.org/linux-mm/YpGbnbi44JqtRg+n@xxxxxxxxxxxxxxxxxxxx/