Hey Andrew, This v4 of the series supersedes v3 currently in mm-unstable, and has the following dependencies, which should be applied in order (with v3 dropped): (1) Patch "mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()" Link: https://lore.kernel.org/linux-mm/20220922222731.1124481-1-zokeefe@xxxxxxxxxx/ (2) Series (2) "mm/madvise: MADV_COLLAPSE return EAGAIN when page cannot be isolated" Link: https://lore.kernel.org/linux-mm/20220922184651.1016461-1-zokeefe@xxxxxxxxxx/ Apologies for the all the patch manipulation trouble. Please consider taking them together into mm-unstable. Best, Zach -------------------------------- v4 Forward This version provides some small cleanups: documentation, comments, readability refactorings. While most of the changes are small, there were enough warrant a new version. See the v3 -> v4 log below for detailed changes. The only change in kernel behavior, as suggested by Yang Shi, is that in khugepaged_add_pte_mapped_thp(), when adding an address to a mm_struct's khugepaged_mm_slot ->pte_mapped_thp[] array, don't check if the address already exists. While there is a possible race between khugepaged and MADV_COLLAPSE that exists that would result in a "multiple-add", it is quite rare and the cost of said "multiple-add" is likely cheaper then preventing the "multiple-add" in the first place (plus, it also simplifies the code)[1]. -------------------------------- v3 Forward This version cleans up a few small issues in v2, expands selftest coverage, rebases on some recent khugepaged changes and adds more details to commit descriptions to help with review. The three main cleanups made are: (1) Patch 2: In hpage_collapse_scan_file() and collapse_file(), don't use then xa_state.xa_index to determine if the HPAGE_PMD_ORDER THP is properly aligned. Instead, check the compound_head(page)->index. Not only is it better to not rely on internal data in struct xa_state (as the comments above said struct definition ask), but it is slightly more accurate / future proof in case we encounter an unaligned compound page of order HPAGE_PMD_ORDER (AFAIK not possible today). Moreover, especially for hpage_collapse_scan_file() where the RCU lock might be dropped as we traverse the XArray, we want to be checking the compound_head(), since otherwise we might erroneously be looking at a tail page if a collapse happened from under us. (2) Patch 2: When hpage_collapse_scan_file() returns SCAN_PTE_MAPPED_HUGEPAGE in the khugepaged path, check the pmd maps a pte table before adding the mm/address to the deferred collapse array. The reason is: we will grab mmap_lock in write every time we attempt collapse_pte_mapped_thp(), so we should try to avoid this if possible. This also prevents khugepaged from repeatedly adding the same mm/address pair to the deferred collapse array after the page cache has already been updated with the new hugepage, but before the memory has been refaulted. (3) Patch 3: In find_pmd_thp_or_none(), check pmd_none() instead of !pmd_present() when detecting pmds that have been cleared. The reason this check exists is because MADV_COLLAPSE might be operating on memory which was already collapsed by khugepaged, but before the memory had been refaulted. In this case, khugepaged cleared the pmd, and so the correct pmd entry to look for is the "none" pmd. -------------------------------- v2 Forward Mostly a RESEND: rebase on latest mm-unstable + minor bug fixes from kernel test robot. -------------------------------- This series builds on top of the previous "mm: userspace hugepage collapse" series which introduced the MADV_COLLAPSE madvise mode and added support for private, anonymous mappings[2], by adding support for file and shmem backed memory to CONFIG_READ_ONLY_THP_FOR_FS=y kernels. File and shmem support have been added with effort to align with existing MADV_COLLAPSE semantics and policy decisions[3]. Collapse of shmem-backed memory ignores kernel-guiding directives and heuristics including all sysfs settings (transparent_hugepage/shmem_enabled), and tmpfs huge= mount options (shmem always supports large folios). Like anonymous mappings, on successful return of MADV_COLLAPSE on file/shmem memory, the contents of memory mapped by the addresses provided will be synchronously pmd-mapped THPs. This functionality unlocks two important uses: (1) Immediately back executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. Now, we can have the best of both worlds: Peak upfront performance and lower RAM footprints. (2) userfaultfd-based live migration of virtual machines satisfy UFFD faults by fetching native-sized pages over the network (to avoid latency of transferring an entire hugepage). However, after guest memory has been fully copied to the new host, MADV_COLLAPSE can be used to immediately increase guest performance. khugepaged has received a small improvement by association and can now detect and collapse pte-mapped THPs. However, there is still work to be done along the file collapse path. Compound pages of arbitrary order still needs to be supported and THP collapse needs to be converted to using folios in general. Eventually, we'd like to move away from the read-only and executable-mapped constraints currently imposed on eligible files and support any inode claiming huge folio support. That said, I think the series as-is covers enough to claim that MADV_COLLAPSE supports file/shmem memory. Patches 1-3 Implement the guts of the series. Patch 4 Is a tracepoint for debugging. Patches 5-9 Refactor existing khugepaged selftests to work with new memory types + new collapse tests. Patch 10 Adds a userfaultfd selftest mode to mimic a functional test of UFFDIO_REGISTER_MODE_MINOR+MADV_COLLAPSE live migration. (v4 note: "userfaultfd shmem" selftest is failing as of Sep 22 mm-unstable) Applies against mm-unstable with: - v3 series dropped - Prerequisite patch (v2) "mm/khugepaged: check compound_order() in collapse_pte_mapped_thp()" [4] - Prerequisite series "mm/madvise: MADV_COLLAPSE return EAGAIN when page cannot be isolated" [5] [1] https://lore.kernel.org/linux-mm/YyiK8YvVcrtZo0z3@xxxxxxxxxx/ [2] https://lore.kernel.org/linux-mm/20220706235936.2197195-1-zokeefe@xxxxxxxxxx/ [3] https://lore.kernel.org/linux-mm/YtBmhaiPHUTkJml8@xxxxxxxxxx/ [4] https://lore.kernel.org/linux-mm/20220922222731.1124481-1-zokeefe@xxxxxxxxxx/ [5] https://lore.kernel.org/linux-mm/20220922184651.1016461-1-zokeefe@xxxxxxxxxx/ Previous versions: v1: https://lore.kernel.org/linux-mm/20220812012843.3948330-1-zokeefe@xxxxxxxxxx/ v2: https://lore.kernel.org/linux-mm/20220826220329.1495407-1-zokeefe@xxxxxxxxxx/ v3: https://lore.kernel.org/linux-mm/20220907144521.3115321-1-zokeefe@xxxxxxxxxx/ v3 -> v4: - [Yang Shi] ("mm/shmem: add flag to enforce shmem THP in hugepage_vma_check()). Explicitly test shmem_huge_force argument in shmem_is_huge() to make it obvious it ignores sysfs settings. - [Yang Shi] ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds") Fix misspelling in commit description. - [Yang Shi] ("mm/khugepaged: attempt to map file/shmem-backed") Corrected comment in collapse_pte_mapped_thp() in to make it accurate what we are checking in fast-path. - [Yang Shi] ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds") Add note in Documentation/admin-guide/mm/transhuge.rst detailing changes to khugepaged/pages_collapsed counter. - [Yang Shi] ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Removed check for "multi-add" in khugepaged_add_pte_mapped_thp() and added comment detailing the possible race, and why this check wasn't necessary. - [Yang Shi] ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Deleted hugepage_vma_revalidate_anon() and added argument to hugepage_vma_revalidate() specifying if we are expecting an anonymous VMA. v2 -> v3: - The 3 changes mentioned in the v3 Forward - Drop redundant PageTransCompound() check in collapse_pte_mapped_thp() in "mm/madvise: add file and shmem support to MADV_COLLAPSE" (it is covered by PageHead() and hugepage_vma_check() for !HugeTLB. - In "selftests/vm: add thp collapse file and tmpfs testing", don't assume path used for file collapse testing will be on /dev/sda - instead, use the major/minor device numbers returned from stat(2) to traverse sysfs and find the correct block device. Also only do stat() statfs() checks on user-supplied test directory once (instead of every time we create a test file). - Added "selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd" which tests a common case of MADV_COLLAPSE applied to file/shmem memory that has been "collapsed" (in the page cache) by khugepaged, but not yet refaulted by the process. v1 -> v2: - Add missing definition for khugepaged_add_pte_mapped_thp() in !CONFIG_SHEM builds, in "mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds" - Minor bugfixes in "mm/madvise: add file and shmem support to MADV_COLLAPSE" for !CONFIG_SHMEM, !CONFIG_TRANSPARENT_HUGEPAGE and some compiler settings. - Rebased on latest mm-unstable Zach O'Keefe (10): mm/shmem: add flag to enforce shmem THP in hugepage_vma_check() mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds mm/madvise: add file and shmem support to MADV_COLLAPSE mm/khugepaged: add tracepoint to hpage_collapse_scan_file() selftests/vm: dedup THP helpers selftests/vm: modularize thp collapse memory operations selftests/vm: add thp collapse file and tmpfs testing selftests/vm: add thp collapse shmem testing selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory Documentation/admin-guide/mm/transhuge.rst | 9 +- include/linux/khugepaged.h | 13 +- include/linux/shmem_fs.h | 10 +- include/trace/events/huge_memory.h | 36 + kernel/events/uprobes.c | 2 +- mm/huge_memory.c | 2 +- mm/khugepaged.c | 311 ++++-- mm/shmem.c | 18 +- tools/testing/selftests/vm/Makefile | 2 + tools/testing/selftests/vm/khugepaged.c | 913 +++++++++++++----- tools/testing/selftests/vm/soft-dirty.c | 2 +- .../selftests/vm/split_huge_page_test.c | 12 +- tools/testing/selftests/vm/userfaultfd.c | 171 +++- tools/testing/selftests/vm/vm_util.c | 36 +- tools/testing/selftests/vm/vm_util.h | 5 +- 15 files changed, 1166 insertions(+), 376 deletions(-) -- 2.37.3.998.g577e59143f-goog