This series introduces multi-page clearing for hugepages. This is a follow up of some of the ideas discussed at: https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@xxxxxxxxxxxxxx/ On x86 page clearing is typically done via string intructions. These, unlike a MOV loop, allow us to explicitly advertise the region-size to the processor, which could serve as a hint to current (and/or future) uarchs to elide cacheline allocation. In current generation processors, Milan (and presumably other Zen variants) use the hint to elide cacheline allocation (for region-size > LLC-size.) An additional reason for doing this is that string instructions are typically microcoded, and clearing in bigger chunks than the current page-at-a- time logic amortizes some of the cost. All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance. There are, however, some problems: 1. extended zeroing periods means there's an increased latency due to the now missing preemption points. That's handled in patches 7, 8, 9: "sched: define TIF_ALLOW_RESCHED" "irqentry: define irqentry_exit_allow_resched()" "x86/clear_huge_page: make clear_contig_region() preemptible" by the context marking itself reschedulable, and rescheduling in irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.) 2. the current page-at-a-time clearing logic does left-right narrowing towards the faulting page which benefits workloads by maintaining cache locality for workloads which have a sequential pattern. Clearing in large chunks loses that. Some (but not all) of that could be ameliorated by something like this patch: https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@xxxxxxxxxx/ But, before doing that I'd like some comments on whether that is worth doing for this specific use case? Rest of the series: Patches 1, 2, 3: "huge_pages: get rid of process_huge_page()" "huge_page: get rid of {clear,copy}_subpage()" "huge_page: allow arch override for clear/copy_huge_page()" are mechanical and they simplify some of the current clear_huge_page() logic. Patches 4, 5: "x86/clear_page: parameterize clear_page*() to specify length" "x86/clear_pages: add clear_pages()" add clear_pages() and helpers. Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the chunked x86 clear_huge_page() implementation. Performance == Demand fault performance gets a decent boost: *Icelakex* mm/clear_huge_page x86/clear_huge_page change (GB/s) (GB/s) pg-sz=2MB 8.76 11.82 +34.93% pg-sz=1GB 8.99 12.18 +35.48% *Milan* mm/clear_huge_page x86/clear_huge_page change (GB/s) (GB/s) pg-sz=2MB 12.24 17.54 +43.30% pg-sz=1GB 17.98 37.24 +107.11% vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs worse when user space tries to touch those pages: *Icelakex* mm/clear_huge_page x86/clear_huge_page change (mem=4GB/task, tasks=128) stime 293.02 +- .49% 239.39 +- .83% -18.30% utime 440.11 +- .28% 508.74 +- .60% +15.59% wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20% *Milan* mm/clear_huge_page x86/clear_huge_page change (mem=1GB/task, tasks=512) stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89% utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85% wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27% Also at: github.com/terminus/linux clear-pages.v1 Comments appreciated! Ankur Arora (9): huge_pages: get rid of process_huge_page() huge_page: get rid of {clear,copy}_subpage() huge_page: allow arch override for clear/copy_huge_page() x86/clear_page: parameterize clear_page*() to specify length x86/clear_pages: add clear_pages() mm/clear_huge_page: use multi-page clearing sched: define TIF_ALLOW_RESCHED irqentry: define irqentry_exit_allow_resched() x86/clear_huge_page: make clear_contig_region() preemptible arch/x86/include/asm/page.h | 6 + arch/x86/include/asm/page_32.h | 6 + arch/x86/include/asm/page_64.h | 25 +++-- arch/x86/include/asm/thread_info.h | 2 + arch/x86/lib/clear_page_64.S | 45 ++++++-- arch/x86/mm/hugetlbpage.c | 59 ++++++++++ include/linux/sched.h | 29 +++++ kernel/entry/common.c | 8 ++ kernel/sched/core.c | 36 +++--- mm/memory.c | 174 +++++++++++++++-------------- 10 files changed, 270 insertions(+), 120 deletions(-) -- 2.31.1