This series introduces two optimizations in the huge page clearing path: 1. extends the clear_page() machinery to also handle extents larger than a single page. 2. support non-cached page clearing for huge and gigantic pages. The first optimization is useful for hugepage fault handling, the second for prefaulting, or for gigantic pages. The immediate motivation is to speedup creation of large VMs backed by huge pages. Performance == VM creation (192GB VM with prealloc'd 2MB backing pages) sees significant run-time improvements: Icelakex: Time (s) Delta (%) clear_page_erms() 22.37 ( +- 0.14s ) # 9.21 bytes/ns clear_pages_erms() 16.49 ( +- 0.06s ) -26.28% # 12.50 bytes/ns clear_pages_movnt() 9.42 ( +- 0.20s ) -42.87% # 21.88 bytes/ns Milan: Time (s) Delta (%) clear_page_erms() 16.49 ( +- 0.06s ) # 12.50 bytes/ns clear_pages_erms() 11.82 ( +- 0.06s ) -28.32% # 17.44 bytes/ns clear_pages_clzero() 4.91 ( +- 0.27s ) -58.49% # 41.98 bytes/ns As a side-effect, non-polluting clearing by eliding zero filling of caches also shows better LLC miss rates. For a kbuild+background page-clearing job, this gives up as a small improvement (~2%) in runtime. Discussion == With the motivation out of the way, the following note describes v3's handling of past review comments (and other sticking points for series of this nature -- especially the non-cached part -- over the years): 1. Non-cached clearing is unnecessary on x86: x86 already uses 'REP;STOS' which unlike a MOVNT loop, has semantically richer information available which can be used by current (and/or future) processors to make the same cache-elision optimization. All true, except a) current-gen uarchs often don't and, b) even when they do, the kernel by clearing at 4K granularity doesn't expose the extent information in a way that processors could easily optimize for. For a), I tested a bunch of REP-STOSB/MOVNTI/CLZERO loops with different chunk sizes (in user-space over a VA extent of 4GB, page-size=4K.) Intel Icelake (LLC=48MB, no_turbo=1): chunk-size REP-STOSB MOVNTI MBps MBps 4K 9444 24510 64K 11931 24508 2M 12355 24524 8M 12369 24525 32M 12368 24523 128M 12374 24522 1GB 12372 24561 Which is pretty flat across chunk-sizes. AMD Milan (LLC=32MB, boost=0): chunk-size REP-STOSB MOVNTI CLZERO MBps MBps MBps 4K 13034 17815 45579 64K 15196 18549 46038 2M 14821 18581 39064 8M 13964 18557 46045 32M 22525 18560 45969 128M 29311 18581 38924 1GB 35807 18574 45981 The scaling on Milan starts right around chunk=LLC-size. It asymptotically does seem to get close to CLZERO performance, but the scaling is linear and not a step function. For b), as I mention above, the kernel by zeroing at 4K granularity, doesn't send the right signal to the uarch (though the largest extent we can use for huge pages is 2MB (and lower for preemptible kernels), which from these numbers is not large enough.) Still using clear_page_extent() with larger extents would send the uarch a hint that it could capitalize on in the future. This is addressed in patches 1-6: "mm, huge-page: reorder arguments to process_huge_page()" "mm, huge-page: refactor process_subpage()" "clear_page: add generic clear_user_pages()" "mm, clear_huge_page: support clear_user_pages()" "mm/huge_page: generalize process_huge_page()" "x86/clear_page: add clear_pages()" with patch 5, "mm/huge_page: generalize process_huge_page()" containing the core logic. 2. Non-caching stores (via MOVNTI, CLZERO on x86) are weakly ordered with respect to the cache hierarchy and unless they are combined with an appropriate fence, are unsafe to use. This is true and is a problem. Patch 12, "sparse: add address_space __incoherent" adds a new sparse address_space which is used in the architectural interfaces to make sure that any user is cognizant of its use: void clear_user_pages_incoherent(__incoherent void *page, ...) void clear_pages_incoherent(__incoherent void *page, ...) One other place it is needed (and is missing) is in highmem: void clear_user_highpages_incoherent(struct page *page, ...). Given the natural highmem interface, I couldn't think of a good way to add the annotation here. 3. Non-caching stores are generally slower than cached for extents smaller than LLC-size, and faster for larger ones. This means that if you choose the non-caching path for too small an extent, you would see performance regressions. There is of course benefit in not filling the cache with zeroes but that is a somewhat nebulous advantage and AFAICT there is no representative tests that probe for it. (Note that this slowness isn't a consequence of the extra fence -- that is expensive but stops being noticeable for chunk-size >= ~32K-128K depending on uarch.) This is handled by adding an arch specific threshold (with a default CLEAR_PAGE_NON_CACHING_THRESHOLD=8MB.) in patches 15 and 16, "mm/clear_page: add clear_page_non_caching_threshold()", "x86/clear_page: add arch_clear_page_non_caching_threshold()". Further, a single call to clear_huge_pages() or get_/pin_user_pages() might only see a small portion of an extent being cleared in each iteration. To make sure we choose non-caching stores when working with large extents, patch 18, "gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING", adds a new flag that gup users can use for this purpose. This is used in patch 20, "vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()" while pinning process memory while attaching passthrough PCIe devices. The get_user_pages() logic to handle these flags is in patch 19, "gup: hint non-caching if clearing large regions". 4. Subpoint of 3) above (non-caching stores are faster for extents larger than LLC-sized) is generally true, with a side of Brownian motion thrown in. For instance, MOVNTI (for > LLC-size) performs well on Broadwell and Ice Lake, but on Skylake/Cascade-lake -- sandwiched in between the two, it does not. To deal with this, use Ingo's suggestion of "trust but verify", (https://lore.kernel.org/lkml/20201014153127.GB1424414@xxxxxxxxx/) where we enable MOVNT by default and only disable it on slow uarchs. If the non-caching path ends up being a part of the kernel, uarchs that regress would hopefully show up early enough in chip testing. Patch 11, "x86/cpuid: add X86_FEATURE_MOVNT_SLOW" adds this logic and patch 21, "x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake" disables the non-caching path for Skylake. Performance numbers are in patches 6 and 19, "x86/clear_page: add clear_pages()", "gup: hint non-caching if clearing large regions". Also at: github.com/terminus/linux clear-page-non-caching.upstream-v3 Comments appreciated! Changelog == v2: https://lore.kernel.org/lkml/20211020170305.376118-1-ankur.a.arora@xxxxxxxxxx/ - Add multi-page clearing: this addresses comments from Ingo (from v1), and from an offlist discussion with Linus. - Rename clear_pages_uncached() to make the lack of safety more obvious: this addresses comments from Andy Lutomorski. - Simplify the clear_huge_page() changes. - Usual cleanups etc. - Rebased to v5.18. v1: https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@xxxxxxxxxx/ - Make the unsafe nature of clear_page_uncached() more obvious. - Invert X86_FEATURE_NT_GOOD to X86_FEATURE_MOVNT_SLOW, so we don't have to explicitly enable it for every new model: suggestion from Ingo Molnar. - Add GUP path (and appropriate threshold) to allow the uncached path to be used for huge pages. - Make the code more generic so it's tied to fewer x86 specific assumptions. Thanks Ankur Ankur Arora (21): mm, huge-page: reorder arguments to process_huge_page() mm, huge-page: refactor process_subpage() clear_page: add generic clear_user_pages() mm, clear_huge_page: support clear_user_pages() mm/huge_page: generalize process_huge_page() x86/clear_page: add clear_pages() x86/asm: add memset_movnti() perf bench: add memset_movnti() x86/asm: add clear_pages_movnt() x86/asm: add clear_pages_clzero() x86/cpuid: add X86_FEATURE_MOVNT_SLOW sparse: add address_space __incoherent clear_page: add generic clear_user_pages_incoherent() x86/clear_page: add clear_pages_incoherent() mm/clear_page: add clear_page_non_caching_threshold() x86/clear_page: add arch_clear_page_non_caching_threshold() clear_huge_page: use non-cached clearing gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING gup: hint non-caching if clearing large regions vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages() x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake arch/alpha/include/asm/page.h | 1 + arch/arc/include/asm/page.h | 1 + arch/arm/include/asm/page.h | 1 + arch/arm64/include/asm/page.h | 1 + arch/csky/include/asm/page.h | 1 + arch/hexagon/include/asm/page.h | 1 + arch/ia64/include/asm/page.h | 1 + arch/m68k/include/asm/page.h | 1 + arch/microblaze/include/asm/page.h | 1 + arch/mips/include/asm/page.h | 1 + arch/nios2/include/asm/page.h | 2 + arch/openrisc/include/asm/page.h | 1 + arch/parisc/include/asm/page.h | 1 + arch/powerpc/include/asm/page.h | 1 + arch/riscv/include/asm/page.h | 1 + arch/s390/include/asm/page.h | 1 + arch/sh/include/asm/page.h | 1 + arch/sparc/include/asm/page_32.h | 1 + arch/sparc/include/asm/page_64.h | 1 + arch/um/include/asm/page.h | 1 + arch/x86/include/asm/cacheinfo.h | 1 + arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/page.h | 26 ++ arch/x86/include/asm/page_64.h | 64 ++++- arch/x86/kernel/cpu/amd.c | 2 + arch/x86/kernel/cpu/bugs.c | 30 +++ arch/x86/kernel/cpu/cacheinfo.c | 13 + arch/x86/kernel/cpu/cpu.h | 2 + arch/x86/kernel/cpu/intel.c | 2 + arch/x86/kernel/setup.c | 6 + arch/x86/lib/clear_page_64.S | 78 ++++-- arch/x86/lib/memset_64.S | 68 ++--- arch/xtensa/include/asm/page.h | 1 + drivers/vfio/vfio_iommu_type1.c | 3 + fs/hugetlbfs/inode.c | 7 +- include/asm-generic/clear_page.h | 69 +++++ include/asm-generic/page.h | 1 + include/linux/compiler_types.h | 2 + include/linux/highmem.h | 46 ++++ include/linux/mm.h | 10 +- include/linux/mm_types.h | 2 + mm/gup.c | 18 ++ mm/huge_memory.c | 3 +- mm/hugetlb.c | 10 +- mm/memory.c | 264 +++++++++++++++---- tools/arch/x86/lib/memset_64.S | 68 ++--- tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 +- 47 files changed, 680 insertions(+), 144 deletions(-) create mode 100644 include/asm-generic/clear_page.h -- 2.31.1